Techniques for bias correction in sequence data

ABSTRACT

Described herein are various methods of collecting and processing of tumor and/or healthy tissue samples to extract nucleic acid and perform nucleic acid sequencing. Also described herein are various methods of processing nucleic acid sequencing data to remove bias from the nucleic acid sequencing data. Also described herein are various methods of evaluating the quality of nucleic acid sequence information. The identity and/or integrity of nucleic acid sequence data is evaluated prior to using the sequence information for subsequent analysis (for example for diagnostic, prognostic, or clinical purposes). The methods enable a subject, doctor, or user to characterize or classify various types of cancer precisely, and thereby determine a therapy or combination of therapies that may be effective to treat a cancer in a subject based on the precise characterization.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S.Provisional Application No. 62/870,622, filed Jul. 3, 2019, entitled“Compositions and Methods for Sample Preparation and Characterization ofCancer Therefrom” and U.S. Provisional Application No. 62/991,570, filedMar. 18, 2020, entitled “Nucleic Acid Data Quality Control,” the entiredisclosure of each is hereby incorporated by reference.

FIELD

Some aspects of the technology described herein relate to collecting andprocessing of tumor and/or healthy tissue samples to extract nucleicacid and perform nucleic acid sequencing. Some aspects of the technologydescribed herein relate to processing nucleic acid sequencing data toremove bias from the nucleic acid sequencing data. Also described hereinare various methods of evaluating the quality of nucleic acid sequenceinformation obtained by sequencing.

BACKGROUND

Correctly characterizing the type or types of cancer a patient orsubject has and, potentially, selecting one or more effective therapiesfor the patient based on the characterization can be crucial for thesurvival and overall wellbeing of that patient. The manner in whichbiological samples from a subject are processed to obtain sequence data(e.g., RNA expression data) to characterize the type or types of cancer,and the manner in which the data is processed may have detrimentaleffects on the characterization of the cancer or cancers. For example,high throughput nucleic acid sequencing platforms (e.g., next generationsequencing platforms) can generate large amounts of DNA and RNA sequencedata from patient samples. Advances in sample preparation, dataprocessing, and evaluation of sequence information from different NGSplatforms by custom software for characterizing cancers, predictingprognoses, identifying effective therapies, and otherwise aiding inpersonalized care of patients with cancer are needed.

SUMMARY

Some embodiments provide for a system comprising: at least one computerhardware processor; and at least one non-transitory computer-readablestorage medium storing processor executable instructions that, whenexecuted by the at least one computer hardware processor, cause the atleast one computer hardware processor to perform a method, the methodcomprising: obtaining nucleic acid data comprising: sequence dataindicating a nucleotide sequence for at least 5 kilobases (kb) of DNAand/or RNA from a previously obtained biological sample of a subjecthaving, suspected of having, or at risk of having a disease; andasserted information indicating an asserted source and/or an assertedintegrity of the sequence data; and validating the nucleic acid data by:processing the sequence data to obtain determined information indicatinga determined source and/or a determined integrity of the sequence data;and determining whether the determined information matches the assertedinformation.

Some embodiments provide at least one non-transitory computer-readablestorage medium storing processor executable instructions that, whenexecuted by at least one computer hardware processor, cause the at leastone computer hardware processor to perform a method. The methodcomprising: obtaining nucleic acid data comprising: sequence dataindicating a nucleotide sequence for at least 5 kilobases (kb) of DNAand/or RNA from a previously obtained biological sample of a subjecthaving, suspected of having, or at risk of having a disease; andasserted information indicating an asserted source and/or an assertedintegrity of the sequence data; and validating the nucleic acid data by:processing the sequence data to obtain determined information indicatinga determined source and/or a determined integrity of the sequence data;and determining whether the determined information matches the assertedinformation.

Some embodiments using at least one computer hardware processor toperform: obtaining nucleic acid data comprising: sequence dataindicating a nucleotide sequence for at least 5 kilobases (kb) of DNAand/or RNA from a previously obtained biological sample of a subjecthaving, suspected of having, or at risk of having a disease; andasserted information indicating an asserted source and/or an assertedintegrity of the sequence data; and validating the nucleic acid data by:processing the sequence data to obtain determined information indicatinga determined source and/or a determined integrity of the sequence data;and determining whether the determined information matches the assertedinformation.

In some embodiments, the sequence data may include raw DNA or RNAsequence data, DNA exome sequence data (e.g., from whole exomesequencing (WES), DNA genome sequence data (e.g., from whole genomesequencing (WGS)), RNA expression data, gene expression data,bias-corrected gene expression data or any other suitable type ofsequence data comprising data obtained from a sequencing platform and/orcomprising data derived from data obtained from a sequencing platform.

In some embodiments, the method further comprises processing thesequence data to determine whether the sequence data is indicative ofone or more disease features when it is determined that the assertedinformation matches the determined information.

In some embodiments, the method further comprises determining that thedetermined information matches the asserted information; and processingthe sequence data to determine whether it is indicative of one or moredisease features.

In some embodiments, the method further comprises generating anindication: that the determined information does not match the assertedinformation, to not process the sequence data in a subsequent analysis,and/or to obtain additional sequence data and/or other information aboutthe biological sample and/or the subject, when it is determined that theasserted information does not match the determined information.

In some embodiments, the method further comprises: determining that theasserted information does not match the determined information; andgenerating an indication: that the determined information does not matchthe asserted information, to not process the sequence data in asubsequent analysis, and/or to obtain additional sequence data and/orother information about the biological sample and/or the subject.

In some embodiments, the asserted information indicates the assertedsource of the sequence data, the method further comprising processingthe sequence data to obtain determined information indicative of adetermined source for the sequence data; and determining whether thedetermined source matches the asserted source for the sequence data.

In some embodiments, the determined information indicative of thedetermined source for the sequence data is indicative of an MHC genotypeof the subject; whether the nucleic acid data is RNA data or DNA data; atissue type of the biological sample; a tumor type of the biologicalsample; a sequencing platform used to generate the sequence data; SNPconcordance, and/or a whether an RNA sample is polyA enriched.

In some embodiments, the determined information indicative of thedetermined source for the sequence data is indicative of at least two ofan MHC genotype of the subject; whether the nucleic acid data is RNAdata or DNA data; a tissue type of the biological sample; a tumor typeof the biological sample; a sequencing platform used to generate thesequence data; SNP concordance, and a whether an RNA sample is polyAenriched.

In some embodiments, the determined information indicative of thedetermined source for the sequence data is indicative of at least threeof an MHC genotype of the subject; whether the nucleic acid data is RNAdata or DNA data; a tissue type of the biological sample; a tumor typeof the biological sample; a sequencing platform used to generate thesequence data; SNP concordance, and a whether an RNA sample is polyAenriched.

In some embodiments, the asserted information indicates the assertedintegrity of the sequence data, the method further comprising:processing the sequence data to obtain determined information indicativeof a determined integrity of the sequence data; and determining whetherthe determined integrity matches the asserted integrity for the sequencedata.

In some embodiments, the determined information indicative of thedetermined integrity is indicative of total sequence coverage; exoncoverage; chromosomal coverage; a ratio of nucleic acids encoding two ormore subunits of a multimeric protein; species contamination; singlenucleotide polymorphisms (SNPs); complexity; and/or guanine (G) andcytosine (C) percentage (%) of the sequence data.

In some embodiments, the determined information indicative of thedetermined integrity is indicative of at least two of total sequencecoverage; exon coverage; chromosomal coverage; a ratio of nucleic acidsencoding two or more subunits of a multimeric protein; speciescontamination; single nucleotide polymorphisms (SNPs); complexity; andguanine (G) and cytosine (C) percentage (%) of the sequence data.

In some embodiments, the determined information indicative of thedetermined integrity is indicative of at least three of total sequencecoverage; exon coverage; chromosomal coverage; a ratio of nucleic acidsencoding two or more subunits of a multimeric protein; speciescontamination; single nucleotide polymorphisms (SNPs); complexity; andguanine (G) and cytosine (C) percentage (%) of the sequence data.

In some embodiments, the asserted information for the sequence datacomprises MHC allele information for the subject.

In some embodiments, the method further comprises determining one ormore MHC allele sequences from the sequence data and determining whetherthe one or more MHC alleles sequences match the asserted MHC alleleinformation for the subject.

In some embodiments, determining the one or more MHC allele sequencescomprises determining MHC allele sequences for six MHC loci from thesequence data.

In some embodiments, wherein the sequence data indicates the nucleotidesequence for RNA, the asserted information indicates whether the RNA ispolyA enriched.

In some embodiments, determining, using the sequence data, a therapy forthe subject when it is determined that the asserted information matchesthe determined information.

In some embodiments, determining the therapy comprises: determining aplurality of gene group expression levels, the plurality of gene groupexpression levels comprising a gene group expression level for each genegroup in a set of gene groups, wherein the set of gene groups comprisesat least one gene group associated with cancer malignancy, and at leastone gene group associated with cancer microenvironment; and identifyingthe therapy using the determined gene group expression levels.

In some embodiments, the method further comprises administering thetherapy to the subject.

In some embodiments, wherein it is determined that the determinedinformation matches the asserted information, the sequence data isprocessed to determine a therapy for the subject, and the therapy isadministered to the subject.

In some embodiments, wherein the disease is cancer, and the therapy is acancer treatment. In some embodiments, the subject is human.

In some embodiments, processing the sequence data to obtain thedetermined source comprises determining one or more single nucleotidepolymorphisms (SNPs) in the sequence data, and determining whether theone or more SNPs in the sequence data match one or more SNPs in areference sequence.

In some embodiments, the reference sequence is a sequence of a nucleicacid in a second biological sample of the subject.

In some embodiments, processing the sequence data to obtain a determinedintegrity comprises: determining a first level of a first nucleic acidencoding a first subunit of a multimeric protein, determining a secondlevel of a second nucleic acid encoding a second subunit of a multimericprotein, and determining whether a ratio between the first level and thesecond level matches an expected ratio. In some embodiments, themultimeric protein is a dimer. In some embodiments, the first subunitand the second subunits are first and second CD3 subunits, first andsecond CD8 subunits, or first and second CD79 subunits.

Some embodiments provide for a system for identifying a cancer treatmentfor a subject having, suspected having, or at risk of having cancer, thesystem comprising: at least one sequencing platform configured togenerate gene expression data from enriched RNA obtained from a firstbiological sample previously obtained from the subject, wherein theenriched RNA was obtained by: (i) extracting RNA from the firstbiological sample of the first tumor to obtain extracted RNA; and (ii)enriching the extracted RNA for coding RNA to obtain enriched RNA,wherein the RNA expression data comprises at least 5 kilobases (kb); atleast one computer hardware processor; and at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by the at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining the RNA expression data using the at least onesequencing platform; converting the RNA expression data to geneexpression data; determining bias-corrected gene expression data fromthe gene expression data at least in part by removing, from the geneexpression data, expression data for at least one gene that introducesbias in the gene expression data; and identifying a cancer treatment forthe subject using the bias-corrected gene expression data.

Some embodiments provide for a system for identifying a cancer treatmentfor a subject having, suspected having, or at risk of having cancer, thesystem comprising: at least one computer hardware processor; and atleast one non-transitory computer-readable storage medium storingprocessor-executable instructions that, when executed by the at leastone computer hardware processor, cause the at least one computerhardware processor to perform: obtaining RNA expression data from atleast one sequencing platform, the RNA expression data comprising atleast 5 kilobases (5 kb), wherein the RNA expression data was obtained,from a first biological sample of a first tumor previously obtained fromthe subject, at least in part by: (i) extracting RNA from the firstbiological sample of the first tumor to obtain extracted RNA; and (ii)enriching the extracted RNA for coding RNA to obtain enriched RNA;converting the RNA expression data to gene expression data; determiningbias-corrected gene expression data from the gene expression data atleast in part by removing, from the gene expression data, expressiondata for at least one gene that introduces bias in the gene expressiondata; and identifying a cancer treatment for the subject using thebias-corrected gene expression data. The system may further comprise theat least one sequencing platform in some embodiments.

Some embodiments provide for at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining RNA expression data from at least one sequencingplatform, the RNA expression data comprising at least 5 kilobases (5kb), wherein the RNA expression data was obtained, from a firstbiological sample of a first tumor previously obtained from a subjecthaving, suspected of having or at risk of having cancer, at least inpart by: (i) extracting RNA from the first biological sample of thefirst tumor to obtain extracted RNA; and (ii) enriching the extractedRNA for coding RNA to obtain enriched RNA; converting the RNA expressiondata to gene expression data; determining bias-corrected gene expressiondata from the gene expression data at least in part by removing, fromthe gene expression data, expression data for at least one gene thatintroduces bias in the gene expression data; and identifying a cancertreatment for the subject using the bias-corrected gene expression data.

Some embodiments provide for a method comprising: obtaining a firstbiological sample of a first tumor, the first biological samplepreviously obtained from a subject having, suspected of having or atrisk of having cancer; extracting RNA from the first biological sampleof the first tumor to obtain extracted RNA; enriching the extracted RNAfor coding RNA to obtain enriched RNA; sequencing, using at least onesequencing platform, the enriched RNA to obtain RNA expression datacomprising at least 5 kilobases (kb); using at least one computerhardware processor to perform: obtaining the RNA expression data usingthe at least one sequencing platform; converting the RNA expression datato gene expression data; determining bias-corrected gene expression datafrom the gene expression data at least in part by removing, from thegene expression data, expression data for at least one gene thatintroduces bias in the gene expression data; and identifying a cancertreatment for the subject using the bias-corrected gene expression data.

In some embodiments, the method further comprises administering theidentified cancer treatment to the subject.

In some embodiments, enriching the RNA for coding RNA comprisesperforming polyA enrichment.

In some embodiments, the at least one gene that introduces bias in thegene expression data comprises: a gene having an average transcriptlength that is higher or lower than an average length of transcripts inthe gene expression data; a gene having at least a threshold variationin average transcript expression level based on transcript expressionlevels in reference samples; and/or a gene that has a polyA tail that isat least a threshold amount smaller in length compared to an averagelength of polyA tails of genes from: the first biological sample fromwhich the RNA expression data was obtained and/or a reference sample.

In some embodiments, the at least one gene that introduces bias in thegene expression data belongs to a family of genes selected from thegroup consisting of: histone-encoding genes, mitochondrial genes,interleukin-encoding genes, collagen-encoding genes, B-cellreceptor-encoding genes, and T cell receptor-encoding genes.

In some embodiments, the at least one gene comprises at least onehistone-encoding gene selected from the group consisting of: HIST1H1A,HIST1H1B, HIST1H1C, HIST1H1D, HIST1H1E, HIST1H1T, HIST1H2AA, HIST1H2AB,HIST1H2AC, HIST1H2AD, HIST1H2AE, HIST1H2AG, HIST1H2AH, HIST1H2AI,HIST1H2AJ, HIST1H2AK, HIST1H2AL, HIST1H2AM, HIST1H2BA, HIST1H2BB,HIST1H2BC, HIST1H2BD, HIST1H2BE, HIST1H2BF, HIST1H2BG, HIST1H2BH,HIST1H2BI, HIST1H2BJ, HIST1H2BK, HIST1H2BL, HIST1H2BM, HIST1H2BN,HIST1H2BO, HIST1H3A, HIST1H3B, HIST1H3C, HIST1H3D, HIST1H3E, HIST1H3F,HIST1H3G, HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4A, HIST1H4B, HIST1H4C,HIST1H4D, HIST1H4E, HIST1H4F, HIST1H4G, HIST1H4H, HIST1H4I, HIST1H4J,HIST1H4K, HIST1H4L, HIST2H2AA3, HIST2H2AA4, HIST2H2AB, HIST2H2AC,HIST2H2BE, HIST2H2BF, HIST2H3A, HIST2H3C, HIST2H3D, HIST2H3PS2,HIST2H4A, HIST2H4B, HIST3H2A, HIST3H2BB, HIST3H3, and HIST4H4.

In some embodiments, the at least one gene comprises at least onemitochondrial gene selected from the group consisting of: MT-ATP6,MT-ATP8, MT-CO1, MT-CO2, MT-CO3, MT-CYB, MT-ND1, MT-ND2, MT-ND3, MT-ND4,MT-ND4L, MT-ND5, MT-ND6, MT-RNR1, MT-RNR2, MT-TA, MT-TC, MT-TD, MT-TE,MT-TF, MT-TG, MT-TH, MT-TI, MT-TK, MT-TL1, MT-TL2, MT-TM, MT-TN, MT-TP,MT-TQ, MT-TR, MT-TS1, MT-TS2, MT-TT, MT-TV, MT-TW, MT-TY, MTRNR2L1,MTRNR2L10, MTRNR2L11, MTRNR2L12, MTRNR2L13, MTRNR2L3, MTRNR2L4,MTRNR2L5, MTRNR2L6, MTRNR2L7, and MTRNR2L8.

In some embodiments, determining the bias-corrected gene expression datafurther comprises: after removing the expression data for the at leastone gene that introduces bias in the gene expression data, renormalizingthe gene expression data.

In some embodiments, converting the RNA expression data to geneexpression data comprises: removing non-coding transcripts from the RNAexpression data to obtain filtered RNA expression data; and afterremoving the non-coding transcripts, normalizing the filtered RNAexpression data to obtain gene expression data in transcripts permillion (TPM) and/or any other suitable format.

In some embodiments, removing the non-coding transcripts from the RNAexpression data comprises removing non-coding transcripts that belong togroups selected from the list consisting of: pseudogenes, polymorphicpseudogenes, processed pseudogenes, transcribed processed pseudogenes,unitary pseudogenes, unprocessed pseudogenes, transcribed unitarypseudogenes, constant chain immunoglobulin (IG C) pseudogenes, joiningchain immunoglobulin (IG J) pseudogenes, variable chain immunoglobulin(IG V) pseudogenes, transcribed unprocessed pseudogenes, translatedunprocessed pseudogenes, joining chain T cell receptor (TR J)pseudogenes, variable chain T cell receptor (TR V) pseudogenes, smallnuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), microRNAs (miRNA),ribozymes, ribosomal RNA (rRNA), mitochondrial tRNAs (Mt tRNA),mitochondrial rRNAs (Mt rRNA), small Cajal body-specific RNAs (scaRNA),retained introns, sense intronic RNA, sense overlapping RNA,nonsense-mediated decay RNA, non-stop decay RNA, antisense RNA, longintervening noncoding RNAs (lincRNA), macro long non-coding RNA (macrolncRNA), processed transcripts, 3prime overlapping non-coding RNA(3prime overlapping ncrna), small RNAs (sRNA), miscellaneous RNA (miscRNA), vault RNA (vaultRNA), and TEC RNA.

In some embodiments, information (e.g., sequence information) for one ormore transcripts for one of more of these types of transcripts can beobtained in a nucleic acid database (e.g., a Gencode database, forexample Gencode V23, Genbank database, EMBL database, or otherdatabase).

In some embodiments, the method further comprises, prior to performingthe removal of the non-coding transcripts, aligning the RNA expressiondata to a reference; and annotating the RNA expression data.

In some embodiments, the RNA expression data comprises at least 25million paired-end reads. In some embodiments, the RNA expression datacomprises at least 50 million paired-end reads, with an average readlength of at least 100 bp.

In some embodiments, identifying the cancer treatment for the subjectusing the bias-corrected gene expression data comprises: determining,using the bias-corrected gene expression data, a plurality of gene groupexpression levels, the plurality of gene group expression levelscomprising a gene group expression level for each gene group in a set ofgene groups, wherein the set of gene groups comprises at least one genegroup associated with cancer malignancy, and at least one gene groupassociated with cancer microenvironment; and identifying the cancertreatment using the determined gene group expression levels.

In some embodiments, the cancer treatment is selected from the groupconsisting of a radiation therapy, a surgical therapy, a chemotherapy,and an immunotherapy.

In some embodiments, the method further comprises obtaining a secondbiological sample of a second tumor, the second biological samplepreviously obtained from the subject.

In some embodiments, the method further comprises combining the firstbiological sample and the second biological sample to form a combinedtumor sample, and extracting the RNA comprises extracting the RNA fromthe combined tumor sample.

In some embodiments, the method further comprises extracting RNA fromthe second biological sample; and combining the RNA extracted from thesecond biological sample with the RNA extracted from the firstbiological sample to form combined extracted RNA, and enriching the RNAfor coding RNA comprises enriching the combined extracted RNA for codingRNA.

In some embodiments, the extracted RNA comprises at least 1 μg of RNAupon RNA extraction.

In some embodiments, the extracted RNA is at least 1000-6000 ng in totalmass, and has a purity corresponding to a ratio of absorbance at 260 nmto absorbance at 280 nm of at least 2.0.

In some embodiments, the method further comprises performing qualitycontrol assessment on the RNA expression data at least in part by:obtaining asserted information indicating an asserted source and/or anasserted integrity of the RNA expression data; processing the RNAexpression data to obtain determined information indicating a determinedsource and/or a determined integrity of the RNA expression data; anddetermining whether the determined information matches the assertedinformation.

In some embodiments, processing the RNA expression data comprisesprocessing the RNA expression RNA to determine: a tissue type of thefirst biological sample; a tumor type of the first biological sample;and/or guanine (G) and/or cytosine (C) percentage (%).

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings form part of the present specification and areincluded to further demonstrate certain aspects of the presentdisclosure, which can be better understood by reference to one or moreof these drawings in combination with the detailed description ofspecific embodiments presented herein. The drawings are not necessarilydrawn to scale.

FIG. 1A and FIG. 1B provide exemplary flow charts that illustrateprocesses of sample preparation and quality control. FIG. 1A provides anexample of a process pipeline that includes one or more quality controlassessments during biopsy sample collection, DNA/RNA extraction andlibrary construction, and/or nucleic acid bioinformatic analysis. FIG.1B provides an example of a process for obtaining a biopsy sample of asubject, extracting nucleic acid from the sample, sequencing the nucleicacid, and processing the nucleic acid sequence to identify one or morecancer therapies appropriate for the subject.

FIGS. 2A and 2B provide graphical representations of the levels anddistribution of RNA transcripts depending on the type of RNA enrichmentmethods used and whether stranded or non-stranded RNA was used forsequencing. FIG. 2A provides a graphical representation of distributionof RNA after RNA enrichment by either depletion of ribosomal RNA (r-RNA)or by poly A enrichment. FIG. 2B provides a graphical representation oflevels of RNA measured after RNA sequencing of either stranded ornon-stranded RNA for IL24, ICAM4, and GAPDH.

FIG. 3 is a graphical representation of the distribution of differentRNA transcripts for different types of RNA as shown in the legend. Eachcolumn represents a unique sample. All samples were prepared from thesame tissue type, using the same RNA enrichment method and the samesequencing service. The bottom panel shows the data in the top panel inTranscripts per Kilobase Million (TPM).

FIG. 4A shows the distribution of poly-A tails of RNA transcripts fromHeLa cell samples, and several examples of poly-A tails for histonefamily genes.

FIG. 4B shows a comparison of expression of mitochondrial RNA forsamples that are either poly-A-enriched or enriched by rRNA depletion(denoted by total RNA).

FIG. 4C shows a comparison of expression of Histone coding RNA forsamples that are either poly-A-enriched or enriched by rRNA depletion(denoted by total RNA).

FIG. 5 shows a principal component analysis (PCA) of RNA expression ofcell samples containing different percentages of HeLa cells, either withor without polyA enrichment, and with or without data filtration. Datafiltration included removal of non-coding RNA transcripts,histone-coding transcripts, and mitochondrial transcripts. ThePCA0-component describes major differences between poly-A and total-RNAsequencing. The PCA1 component describes different cell line ratios.Samples were prepared as a mixture of two different cell lines in 5different ratios.

FIG. 6A is an exemplary flowchart that illustrates a process 200 forobtaining enriched RNA sequence data from a tumor of a subject having,suspected of having, or at risk of having cancer.

FIG. 6B is an exemplary flowchart that illustrates a process 210 forobtaining bias-corrected gene expression data from RNA expression datato identify a cancer treatment for a subject having, suspected ofhaving, or at risk of having cancer.

FIG. 6C is an exemplary flowchart that illustrates a process 220 forprocessing RNA obtained from a tumor sample to identify a cancertreatment for a subject having, suspected of having, or at risk ofhaving cancer.

FIG. 7 is flowchart of an illustrative process pipeline 300 comprisingbioinformatic quality control processes for assessing nucleic acidsequence data obtained from a tumor sample and using the nucleic acidsequence data to identify a cancer treatment for a subject having,suspected of having, or at risk of having cancer.

FIG. 8 is an exemplary flowchart that illustrates a process 800 showingcomputerized processes for processing and validating sequence data andrelated information.

FIG. 9 is a block diagram of an illustrative computer system 500 thatmay be used to implement one or more embodiments of a process pipelinefor preparing, assessing, and/or analyzing sequence data.

FIG. 10 is a block diagram of an illustrative environment 600 in whichone or more embodiments of the technology described in this applicationmay be implemented.

FIG. 11 shows the results of MHC allele analysis for sequenceinformation obtained from three nucleic acids (RNA-Seq, WES Tumor, andWES Normal) for two subjects (103 and 105).

FIG. 12 shows an example of a bar graph representing the probability ofsequence information being from a particular type of tumor (e.g., BRCArelated to breast cancer).

FIGS. 13A-13B show graphs representing an example of the relationshipbetween protein subunit expression levels.

FIGS. 14A-14B show examples of bar graphs representing the probabilitythat sequence information was obtained from samples that contained onlypolyadenylated RNA or from samples that contained total or all RNA(total RNA).

FIG. 15 shows an example of a principal component analysis andillustrates the analysis of three batches of gene expression dataincluding tumor and normal samples.

DETAILED DESCRIPTION

Recent advances in personalized genomic sequencing and cancer genomicsequencing technologies have made it possible to obtain patient-specificinformation about cancer cells (e.g., tumor cells) and cancermicroenvironments from one or more biological samples obtained fromindividual patients. The inventors have appreciated that thisinformation may be used to characterize the type(s) of cancer a patienthas and, potentially, select one or more effective therapies for thepatient. This information may also be used to determine how a patient isresponding over time to a treatment and, if necessary, to select a newtherapy or therapies for the patient as necessary. This information mayalso be used to determine whether a patient should be included orexcluded from participating in a clinical trial.

The inventors have recognized that the workflow used to obtain sequencedata for a patient strongly influences the inferences that can be drawnabout the patient's cancer. Such inferences include, but are not limitedto, determining whether the patient will respond to a particular therapyor therapies, whether the patient will have an adverse reaction to aparticular therapy or therapies, whether the patient is a candidate forenrollment in a clinical trial, whether the patient has one or moreparticular biomarkers (e.g., biomarkers indicative of potential responseto a therapy, biomarkers indicative of survival, etc.), whether thepatient's disease has progressed (e.g., from an earlier stage cancer toa later stage cancer, relapsed from remission, etc.), whether adifferent therapy or therapies should be selected for the patient,and/or any other suitable prognostic, diagnostic, and/or clinicalinferences.

When the workflow used to obtain sequence data contains errors,sub-optimal processing, sources of bias in the data, and the like, it isoften not possible to make inferences about the subject's cancer withthe desired or necessary confidence, or even make any such inferences atall. Even worse, errors in the workflow for producing sequence data mayresult in incorrect inferences about the patient, potentially leading toincorrect treatment or missed opportunities for better treatment.Moreover, workflow errors lead to wasted resources in the laboratory(e.g., having to reprocess samples) and wasted computing resources(e.g., performing expensive computational processing on megabytes andgigabytes of sequence data, taking up processor and networkingresources, only to discard the results at a later time and/or have torepeat the processing).

A conventional workflow used to obtain sequence data for a patientincludes multiple steps including: obtaining a biological sample fromthe patient (e.g., by performing a biopsy, obtaining a blood sample, asalivary sample or any other suitable biological sample from thepatient), preparing the biological sample for sequencing using asequencing platform (e.g., a next generation sequencing (NGS) platform),and obtaining raw data output by the sequencing platform. Variousconventional bio-informatics processing pipelines and other algorithmsmay then use the raw data output by the sequencing platform in anattempt to make one or more of the above-described inferences.

However, such conventional workflows for obtaining sequencing data areprone to errors at all stages. For example, errors may be made in alaboratory when handling samples for multiple patients. Indeed, it isnot uncommon for a laboratory to receive a biological sample asserted tobe from one patient, when that sample is from another patient. Asanother example, a biological sample may not be processed properly bythe laboratory and may not have the concentration and/or quality ofnucleic acid needed for subsequent analysis. As yet another example,errors may be introduced by the sequencing platform itself and/orsubsequent post processing steps (e.g., alignment and variant calling).As yet another example, raw sequencing data produced by a sequencingplatform may contain artefacts and undesired sequences and/ortranscripts. Other examples of various errors are described herein.

In some embodiments, sequence data or sequencing data may include rawDNA or RNA sequence data, DNA exome sequence data (e.g., from wholeexome sequencing (WES), DNA genome sequence data (e.g., from wholegenome sequencing (WGS)), RNA expression data, gene expression data,bias-corrected gene expression data or any other suitable type ofsequence data comprising data obtained from a sequencing platform and/orcomprising data derived from data obtained from a sequencing platform.

To address shortcomings of conventional workflows for obtainingsequencing data for patients, the inventors have developed techniquesthat address various sources of error that may be present in sequencingdata. These techniques developed by the inventors include: (1) novelsample preparation techniques to prepare biological samples forsequencing using one or multiple sequencing platforms; (2) noveltechniques for post processing raw data output by the sequencingplatform(s) to filter out irrelevant data and sources of bias (e.g.,transcripts for non-coding regions and expression data associated withgenes that introduce bias in the sequence data); and (3) novel qualitycontrol techniques that facilitate the detection and remediation oferrors in the sequence data. In some embodiments, techniques from eachof these three categories may be utilized in a workflow to obtainsequence data for a patient, though it should be appreciated that thisis not a limitation of the techniques described herein, and that, insome embodiments, any one or more of the techniques (but not necessarilyall of them) may be used in a workflow.

As one example, in some embodiments, novel sample preparation techniquesand post-processing techniques include obtaining sequencing data andremoving sources of bias from the sequencing data by: (1) obtaining afirst biological sample of a first tumor, the first biological samplepreviously obtained from a subject having, suspected of having or atrisk of having cancer; (2) extracting RNA from the first biologicalsample of the first tumor to obtain extracted RNA; (3) enriching theextracted RNA for coding RNA to obtain enriched RNA; (4) sequencing,using at least one sequencing platform, the enriched RNA to obtain RNAexpression data comprising at least 5 kilobases (kb); and (5) using atleast one computer hardware processor to perform: (a) obtaining the RNAexpression data using the at least one sequencing platform; (b)converting the RNA expression data to gene expression data; (c)determining bias-corrected gene expression data from the gene expressiondata at least in part by removing, from the gene expression data,expression data for at least one gene that introduces bias in the geneexpression data; and (d) identifying a cancer treatment for the subjectusing the bias-corrected gene expression data.

Removing bias from the gene expression data in this way provides animprovement to sequencing technology for numerous reasons. First, itremoves artefacts and sources of bias from sequencing data, resulting infewer errors in any downstream processing and a higher fidelity output.Second, the inventors have recognized that removing sources of bias inthis way allows for more accurately and faithfully representing apatient's molecular functional characteristics (e.g., via molecularfunctional expression signatures described herein). The inventors haverecognized that the bias-corrected gene expression data may be used toidentify more effective therapies for a patient, improve ability todetermine whether one or more cancer therapies will be effective ifadministered to the patient, improve the ability to identify clinicaltrials in which the subject may participate, and/or improvements tonumerous other prognostic, diagnostic, and clinical applications.

As another example, in some embodiments, novel quality controltechniques include using at least one computer hardware processor toperform: (a) obtaining nucleic acid data comprising: (i) sequence dataindicating a nucleotide sequence for at least 5 kilobases (kb) of DNAand/or RNA from a previously obtained biological sample of a subjecthaving, suspected of having, or at risk of having a disease; and (ii)asserted information indicating an asserted source and/or an assertedintegrity of the sequence data; and (b) validating the nucleic acid databy: (i) processing the sequence data to obtain determined informationindicating a determined source and/or a determined integrity of thesequence data; and (ii) determining whether the determined informationmatches the asserted information. Examples of various such validationtechniques are described herein, and they are important examples ofquality control techniques developed by the inventors and describedherein.

Employing such quality control techniques also provides an improvementto sequencing technology and computer technology. First, sequencing datathat do not pass one or more quality control checks are not used forsome or all of downstream processing reducing or eliminating errors indownstream applications (e.g., identifying biomarkers, tumormicroenvironment types, possible therapies for a patient, etc.). Oftensuch downstream processing requires performing expensive (frequentlycloud-based) computational processing of large data sets (e.g.,sequencing data may contain tens of millions of reads, which have to bealigned, annotated and processed in other numerous ways). Using qualitycontrol to prevent computationally expensive processes from executingwill reduce or eliminate wasteful use of computing resources, savingprocessing power, memory, and networking resources (which is animprovement to computing technology in addition to being an improvementto sequencing technology). Identifying errors will also reduce waste ofresources at a laboratory that processes multiple samples, by freeing upequipment for processing biological samples that have passed initialquality control checks. In addition, using sequence data for downstreamprocessing that has passed various quality control checks may be used toidentify more effective therapies for a patient, improve ability todetermine whether one or more cancer therapies will be effective ifadministered to the patient, improve the ability to identify clinicaltrials in which the subject may participate, and/or improvements tonumerous other prognostic, diagnostic, and clinical applications.

FIGS. 1A and 1B illustrate examples of process pipelines for samplepreparation and quality control as described herein. The processpipeline in FIG. 1 illustrate embodiments of methods and systemsprovided in the present disclosure and are not to be construed in anyway as limiting their scope. The present disclosure provides that aprocess pipeline does not need to include all of the process steps orthe order of process steps illustrated in FIG. 1. One or more processescan be omitted, repeated, or performed in a different order depending onthe application.

FIG. 1A illustrates a non-limiting process pipeline 100 that includesone or more quality control assessments. A biological sample (e.g., atumor biopsy) is obtained for a subject (e.g., a subject having,suspected of having, or at risk of having cancer) in act 101. In someembodiments, the sample is obtained from a physician, hospital, clinic,or other healthcare provider. One or more sample quality controlassessments at quality control act 102 can be performed on thebiological sample. In some embodiments, a quality control assessment onthe biological sample (e.g., biopsy material) comprises determiningwhether the sample is in an appropriate form (e.g., fresh frozen orFFPE) and/or is accompanied by sufficient information to identify thenature and source of the sample. Subsequently, nucleic acid (e.g., DNAand/or RNA) can be extracted from a biological sample that satisfiessample quality control act 102. One or more nucleic acid quality controlassessments at act 103 then can be performed, for example to evaluateone or more physical attributes of the extracted nucleic acid, of anucleic acid library prepared from the extracted nucleic acid, and/or ofpooled nucleic acids or libraries. Subsequently, nucleic acid (e.g., DNAand/or RNA) that satisfies nucleic acid quality control act 103 can beprocessed (e.g., to enrich for polyA RNA) and/or sequenced to obtain rawDNA and/or RNA sequence data (e.g., RNA expression data). In someembodiments, RNA expression data can be processed to obtain geneexpression data and optionally to remove data for one or more types ofgenes that could interfere with (e.g., bias) subsequent analysis of thegene expression data. In some embodiments, gene expression data isnormalized (e.g., after removal of the data for the one or moreinterfering genes). In some embodiments, one or more sequence qualitycontrol assessments are performed on DNA and/or RNA sequence data (e.g.,on processed, for example normalized, gene expression data) forbioinformatic quality control act 104. In some embodiments, one or morebioinformatic quality control assessments are performed to determinewhether sequence data is from an expected source (e.g., patient, tissue,tumor, etc.) and/or whether it has sufficient integrity for furtheranalysis. In some embodiments, sequence data that satisfiesbioinformatic quality control act 104 is further processed, for example,to determine a diagnosis, prognosis, and/or therapy for a subject, toevaluate and/or monitor a subject, and/or for one or more clinicalapplications (e.g., to evaluate a therapy).

In some embodiments, the sequence data may include raw DNA or RNAsequence data, DNA exome sequence data (e.g., from whole exomesequencing (WES), DNA genome sequence data (e.g., from whole genomesequencing (WGS)), RNA expression data, gene expression data,bias-corrected gene expression data or any other suitable type ofsequence data comprising data obtained from a sequencing platform and/orcomprising data derived from data obtained from a sequencing platformincluding, but not limited to, examples of such data described herein.

FIG. 1B illustrates a non-limiting process pipeline 110 for preparingnucleic acid from a biological sample (e.g., a tumor biopsy) andobtaining and processing nucleic acid sequence data for subsequentanalysis (e.g., for diagnostic, prognostic, therapeutic, and/or otherclinical applications). Process pipeline 110 is performed by obtaining abiological sample (e.g., a tumor sample) from a subject having,suspected of having, or at risk of having cancer at act 111. Nucleicacid (e.g., DNA and/or RNA) is obtained (e.g., extracted) from thesample at act 112. One or more quality control assessments of thenucleic acid is performed in act 113. One or more nucleic acid librariesis prepared in act 114, for example using nucleic acid that satisfies atleast one quality control assessment of act 113. The nucleic acidlibraries are sequenced using at least one sequencing platform insequencing act 115 (e.g., to obtain RNA expression data for RNA). Insome embodiments, RNA expression data is converted to gene expressiondata in act 116, and the gene expression data is optionallybias-corrected, at least in part, by removing expression data for atleast one gene that introduces bias in the gene expression data. One ormore bioinformatic quality control assessments are performed on the DNAsequence data or RNA sequence data from act 115 and/or RNA sequence data(e.g., the bias-corrected gene expression data from act 116) inbioinformatics quality control act 117. In some embodiments, nucleicacid data (e.g., that satisfies at least one bioinformatic qualitycontrol assessments of act 117), is further processed in act 118 (e.g.,to determine one or more indicia of disease from the gene expressiondata), to perform a diagnostic, prognostic, therapeutic, and/or otherclinical assessment of the subject (e.g., to identify a treatment, forexample a cancer treatment, for the subject) in act 119. In someembodiments, a treatment (e.g., a cancer treatment) is administered tothe subject.

In some embodiments, act 111 comprises obtaining bulk biopsy tissues ofa subject or a patient. In some embodiments, act 111 comprises obtaininga blood sample of a subject or a patient. In some embodiments, act 111comprises obtaining a single cell suspension. In some embodiments, act111 comprises obtaining any types of sample that are suitable forpreparing nucleic acids for subsequent sequencing analysis. In someembodiments, act 111 comprises obtaining more than one type of samples.

In some embodiments, when the bulk biopsy tissues are obtained, thetissues are processed (e.g., homogenized in the presence of TriZol) toextract nucleic acids such as DNA or RNA at act 112. In someembodiments, when a single cell suspension is obtained, the suspensionis processed to extract nucleic acids such as DNA or RNA at act 112. Insome embodiments, nucleic acids can be extracted that are suitable forgermline whole exome sequencing (WES) at act 112. In some embodiments,nucleic acids can be extracted that are suitable for tumor whole exomesequencing (WES) at act 112. In some embodiments, nucleic acids can beextracted that are suitable for tumor RNA sequencing at act 112. In someembodiments, nucleic acids can be extracted that are suitable for CYTOF(mass cytometry) at act 112. In some embodiments, nucleic acids can beextracted that are suitable for any type of sequencing known in the artat act 112.

At act 113, one or more quality control assessments can be performed.Acceptable and/or target thresholds can be determined and used asreferences. In some embodiments, the total amount of extracted DNA orRNA can be used for quality control assessment. In some embodiments, aspectrophotometer, for example a small volume full-spectrum, UV-visiblespectrophotometer (e.g., NanoDrop spectrophotometer available fromThermoFisher Scientific, www.thermofischer.com) can be used for qualitycontrol assessment of DNA or RNA. In some embodiments, a fluorometer,for example for quantification of DNA or RNA (e.g., a Qubit fluorometeravailable from ThermoFisher Scientific, www.thermofischer.com) can beused for quality control assessment of DNA or RNA. In some embodiments,an automated electrophoresis system (e.g., TAPESTATION) can be used forquality control assessment of DNA or RNA. In some embodiments, areal-time PCR system (e.g., LIGHTCYCLER®) can be used for qualitycontrol assessment of DNA or RNA.

In some embodiments, act 114 comprises preparing libraries for theextracted nucleic acids that have satisfied at least one quality controlthreshold at act 113. In some embodiments, act 114 comprises one or moremethods described in Example 2.

In some embodiments, act 115 comprises sequencing nucleic acid (e.g.,the DNA, RNA, or related libraries of act 114) to obtain DNA sequencedata and/or RNA sequence data (e.g., RNA expression data) using at leastone nucleic acid sequencing platform (e.g., a next generation nucleicacid sequencing platform). Sequence data obtained at act 115 can bestored in any suitable format (e.g., in the form of one or more FASTQfiles).

In some embodiments, RNA expression data is converted to gene expressiondata at act 116. In some embodiments, RNA expression data is aligned toknown genes in a database, for example to a known assembled genome(e.g., a human genome) or to a transcriptome in the database. In someembodiments, a program for quantifying transcripts, for example frombulk and single-cell RNA-Seq data, using high-throughput sequencingreads (e.g., Kallisto (hg38) available from Github, www.github.com, forexample as described in Nicolas L Bray, Harold Pimentel, Pall Melstedand Lior Pachter, Near-optimal probabilistic RNA-seq quantification,Nature Biotechnology 34, 525-527 (2016), doi:10.1038/nbt.3519), and/orGencode (e.g., Gencode V23) is used for sequence alignment, and/orannotation. In some embodiments, act 116 comprises gene aggregation. Insome embodiments, act 116 comprises removing expression data for one ormore non-coding transcripts from the gene expression data. In someembodiments, act 116 comprises removing expression data for one or moregenes that can bias the gene expression data. In some embodiments, act116 comprises removing expression data for histone encoding genes and/ormitochondrial-encoding genes. In some embodiments, act 116 comprisesnormalization (e.g., TPM normalization) after removal of expression datafor non-coding and/or bias-associated genes from the gene expressiondata. This normalization may be termed “renormalization” herein.

At act 117, one or more bioinformatic quality control assessments isperformed on nucleic acid sequence data, for example DNA sequence dataand/or RNA sequence data (e.g., bias corrected, and/or normalized, geneexpression data). In some embodiments, one or more bioinformatic qualitycontrol assessments can be performed to evaluate the source and/orintegrity of the nucleic acid sequence data. In some embodiments, one ormore bioinformatic quality control assessments described in thisapplication are performed.

In some embodiments, a method comprises all processes illustrated inFIG. 1. However, in some embodiments, a subset of the processes isperformed and any one or more of the processes may be omitted,duplicated, and/or performed in a different order than illustrated inFIG. 1. In some embodiments, a method comprises a process, optionallyincluding one or more quality control steps, for preparing a nucleicacid from a biological sample, wherein the nucleic acid is sequenced onat least one sequencing platform. In some embodiments, a methodcomprises processing nucleic acid information obtained (e.g., received)from a sequencing platform to generate DNA or RNA sequence data forsubsequent analysis (e.g., to generate bias-corrected, optionallynormalized gene expression data for subsequent analysis). In someembodiments, one or more processes of FIG. 1 are implemented on acomputer. In some embodiments, a method comprises identifying atreatment (e.g., a cancer treatment) for a subject (e.g., a subjecthaving, suspected of having, or at risk of having cancer). In someembodiments, a method comprises administering the treatment to thesubject.

Biological Samples

Any of the methods, systems, or other claimed elements may use or beused to analyze a biological sample from a subject. In some embodiments,a biological sample is obtained from a subject having or suspected ofhaving cancer. One or more biological samples from a subject may beanalyzed as described herein to obtain information about the subject'scancer. The biological sample may be any type of biological sampleincluding, for example, a biological sample of a bodily fluid (e.g.,blood, urine or cerebrospinal fluid), one or more cells (e.g., from ascraping or brushing such as a cheek swab or tracheal brushing), a pieceof tissue (cheek tissue, muscle tissue, lung tissue, heart tissue, braintissue, or skin tissue), or some or all of an organ (e.g., brain, lung,liver, bladder, kidney, pancreas, intestines, or muscle), or other typesof biological samples (e.g., feces or hair).

In some embodiments, the biological sample is a sample of a tumor from asubject. In some embodiments, the biological sample is a sample of bloodfrom a subject. In some embodiments, the biological sample is a sampleof tissue from a subject.

A sample of a tumor, in some embodiments, refers to a sample comprisingcells from a tumor. In some embodiments, the sample of the tumorcomprises cells from a benign tumor, e.g., non-cancerous cells. In someembodiments, the sample of the tumor comprises cells from a premalignanttumor, e.g., precancerous cells. In some embodiments, the sample of thetumor comprises cells from a malignant tumor, e.g., cancerous cells.

Examples of tumors include, but are not limited to, adenomas, fibromas,hemangiomas, lipomas, cervical dysplasia, metaplasia of the lung,leukoplakia, carcinoma, sarcoma, germ cell tumors, and blastoma.

A sample of blood, in some embodiments, refers to a sample comprisingcells, e.g., cells from a blood sample. In some embodiments, the sampleof blood comprises non-cancerous cells. In some embodiments, the sampleof blood comprises precancerous cells. In some embodiments, the sampleof blood comprises cancerous cells. In some embodiments, the sample ofblood comprises blood cells. In some embodiments, the sample of bloodcomprises red blood cells. In some embodiments, the sample of bloodcomprises white blood cells. In some embodiments, the sample of bloodcomprises platelets. Examples of cancerous blood cells include, but arenot limited to, leukemia, lymphoma, and myeloma. In some embodiments, asample of blood is collected to obtain the cell-free nucleic acid (e.g.,cell-free DNA) in the blood.

A sample of blood may be a sample of whole blood or a sample offractionated blood. In some embodiments, the sample of blood compriseswhole blood. In some embodiments, the sample of blood comprisesfractionated blood. In some embodiments, the sample of blood comprisesbuffy coat. In some embodiments, the sample of blood comprises serum. Insome embodiments, the sample of blood comprises plasma. In someembodiments, the sample of blood comprises a blood clot.

A sample of a tissue, in some embodiments, refers to a sample comprisingcells from a tissue. In some embodiments, the sample of the tumorcomprises non-cancerous cells from a tissue. In some embodiments, thesample of the tumor comprises precancerous cells from a tissue. In someembodiments, the sample of the tumor comprises precancerous cells from atissue.

Methods of the present disclosure encompass a variety of tissueincluding organ tissue or non-organ tissue, including but not limitedto, muscle tissue, brain tissue, lung tissue, liver tissue, epithelialtissue, connective tissue, and nervous tissue. In some embodiments, thetissue may be normal tissue or it may be diseased tissue or it may betissue suspected of being diseased. In some embodiments, the tissue maybe sectioned tissue or whole intact tissue. In some embodiments, thetissue may be animal tissue or human tissue. Animal tissue includes, butis not limited to, tissues obtained from rodents (e.g., rats or mice),primates (e.g., monkeys), dogs, cats, and farm animals.

The biological sample may be from any source in the subject's bodyincluding, but not limited to, any fluid [such as blood (e.g., wholeblood, blood serum, or blood plasma), saliva, tears, synovial fluid,cerebrospinal fluid, pleural fluid, pericardial fluid, ascitic fluid,and/or urine], hair, skin (including portions of the epidermis, dermis,and/or hypodermis), oropharynx, laryngopharynx, esophagus, stomach,bronchus, salivary gland, tongue, oral cavity, nasal cavity, vaginalcavity, anal cavity, bone, bone marrow, brain, thymus, spleen, smallintestine, appendix, colon, rectum, anus, liver, biliary tract,pancreas, kidney, ureter, bladder, urethra, uterus, vagina, vulva,ovary, cervix, scrotum, penis, prostate, testicle, seminal vesicles,and/or any type of tissue (e.g., muscle tissue, epithelial tissue,connective tissue, or nervous tissue).

Any of the biological samples described herein may be obtained from thesubject using any known technique. See, for example, the followingpublications on collecting, processing, and storing biological samples,each of which are incorporated herein in its entirety: Biospecimens andbiorepositories: from afterthought to science by Vaught et al. (CancerEpidemiol Biomarkers Prev. 2012 February; 21(2):253-5), and Biologicalsample collection, processing, storage and information management byVaught and Henderson (IARC Sci Publ. 2011; (163):23-42).

In some embodiments, the biological sample may be obtained from asurgical procedure (e.g., laparoscopic surgery, microscopicallycontrolled surgery, or endoscopy), bone marrow biopsy, punch biopsy,endoscopic biopsy, or needle biopsy (e.g., a fine-needle aspiration,core needle biopsy, vacuum-assisted biopsy, or image-guided biopsy). Insome embodiments, the biological sample may be obtained from an autopsy.

In some embodiments, one or more than one cell (i.e., a cell biologicalsample) may be obtained from a subject using a scrape or brush method.The cell biological sample may be obtained from any area in or from thebody of a subject including, for example, from one or more of thefollowing areas: the cervix, esophagus, stomach, bronchus, or oralcavity. In some embodiments, one or more than one piece of tissue (e.g.,a tissue biopsy) from a subject may be used. In certain embodiments, thetissue biopsy may comprise one or more than one (e.g., 2, 3, 4, 5, 6, 7,8, 9, 10, or more than 10) biological samples from one or more tumors ortissues known or suspected of having cancerous cells.

Any of the biological samples from a subject described herein may bestored using any method that preserves stability of the biologicalsample. In some embodiments, preserving the stability of the biologicalsample means inhibiting components (e.g., DNA, RNA, protein, or tissuestructure or morphology) of the biological sample from degrading untilthey are measured so that when measured, the measurements represents thestate of the sample at the time of obtaining it from the subject. Insome embodiments, a biological sample is stored in a composition that isable to penetrate the same and protect components (e.g., DNA, RNA,protein, or tissue structure or morphology) of the biological samplefrom degrading. As used herein, degradation is the transformation of acomponent from one from to another such that the first form is no longerdetected at the same level as before degradation.

In some embodiments, the biological sample is stored usingcryopreservation. Non-limiting examples of cryopreservation include, butare not limited to, step-down freezing, blast freezing, direct plungefreezing, snap freezing, slow freezing using a programmable freezer, andvitrification. In some embodiments, the biological sample is storedusing lyophilisation. In some embodiments, a biological sample is placedinto a container that already contains a preservant (e.g., RNALater topreserve RNA) and then frozen (e.g., by snap-freezing), after thecollection of the biological sample from the subject. In someembodiments, such storage in frozen state is done immediately aftercollection of the biological sample. In some embodiments, a biologicalsample may be kept at either room temperature or 4° C. for some time(e.g., up to an hour, up to 8 h, or up to 1 day, or a few days) in apreservant or in a buffer without a preservant, before being frozen.

Non-limiting examples of preservants include formalin solutions,formaldehyde solutions, RNALater or other equivalent solutions, TriZolor other equivalent solutions, DNA/RNA Shield or equivalent solutions,EDTA (e.g., Buffer AE (10 mM Tris.Cl; 0.5 mM EDTA, pH 9.0)) and othercoagulants, and Acids Citrate Dextronse (e.g., for blood specimens). Insome embodiments, special containers may be used for collecting and/orstoring a biological sample. For example, a vacutainer may be used tostore blood. In some embodiments, a vacutainer may comprise a preservant(e.g., a coagulant, or an anticoagulant). In some embodiments, acontainer in which a biological sample is preserved may be contained ina secondary container, for the purpose of better preservation, or forthe purpose of avoid contamination.

Any of the biological samples from a subject described herein may bestored under any condition that preserves stability of the biologicalsample. In some embodiments, the biological sample is stored at atemperature that preserves stability of the biological sample. In someembodiments, the sample is stored at room temperature (e.g., 25° C.). Insome embodiments, the sample is stored under refrigeration (e.g., 4°C.). In some embodiments, the sample is stored under freezing conditions(e.g., −20° C.). In some embodiments, the sample is stored underultralow temperature conditions (e.g., −50° C. to −800° C.). In someembodiments, the sample is stored under liquid nitrogen (e.g., −1700°C.). In some embodiments, a biological sample is stored at −60° C. to−80° C. (e.g., −70° C.) for up to 5 years (e.g., up to 1 month, up to 2months, up to 3 months, up to 4 months, up to 5 months, up to 6 months,up to 7 months, up to 8 months, up to 9 months, up to 10 months, up to11 months, up to 1 year, up to 2 years, up to 3 years, up to 4 years, orup to 5 years). In some embodiments, a biological sample is stored asdescribed by any of the methods described herein for up to 20 years(e.g., up to 5 years, up to 10 years, up to 15 years, or up to 20years).

Methods of the present disclosure encompass obtaining one or morebiological samples from a subject for analysis. In some embodiments, onebiological sample is collected from a subject for analysis. In someembodiments, more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, or more) biological samples arecollected from a subject for analysis. In some embodiments, onebiological sample from a subject will be analyzed. In some embodiments,more than one (e.g., 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19, 20, or more) biological samples may be analyzed. If morethan one biological sample from a subject is analyzed, the biologicalsamples may be procured at the same time (e.g., more than one biologicalsample may be taken in the same procedure), or the biological samplesmay be taken at different times (e.g., during a different procedureincluding a procedure 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 days; 1, 2, 3, 4, 5,6, 7, 8, 9, 10 weeks; 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 months, 1, 2, 3, 4,5, 6, 7, 8, 9, 10 years, or 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 decades aftera first procedure).

A second or subsequent biological sample may be taken or obtained fromthe same region (e.g., from the same tumor or area of tissue) or adifferent region (including, e.g., a different tumor). A second orsubsequent biological sample may be taken or obtained from the subjectafter one or more treatments and may be taken from the same region or adifferent region. As a non-limiting example, the second or subsequentbiological sample may be useful in determining whether the cancer ineach biological sample has different characteristics (e.g., in the caseof biological samples taken from two physically separate tumors in apatient) or whether the cancer has responded to one or more treatments(e.g., in the case of two or more biological samples from the same tumoror different tumors prior to and subsequent to a treatment). In someembodiments, each of the at least one biological sample is a bodilyfluid sample, a cell sample, or a tissue biopsy sample.

In some embodiments, one or more biological specimens are combined(e.g., placed in the same container for preservation) before furtherprocessing. For example, a first sample of a first tumor obtained from asubject may be combined with a second sample of a second tumor from thesubject, wherein the first and second tumors may or may not be the sametumor. In some embodiments, a first tumor and a second tumor are similarbut not the same (e.g., two tumors in the brain of a subject). In someembodiments, a first biological sample and a second biological samplefrom a subject are sample of different types of tumors (e.g., a tumor inmuscle tissue and brain tissue).

In some embodiments, a sample from which RNA and/or DNA is extracted(e.g., a sample of tumor, or a blood sample) is sufficiently large suchthat at least 2 μg (e.g., at least 2 μg, at least 2.5 μg, at least 3 μg,at least 3.5 μg or more) of RNA can be extracted from it. In someembodiments, the sample from which RNA and/or DNA is extracted can beperipheral blood mononuclear cells (PBMCs). In some embodiments, thesample from which RNA and/or DNA is extracted can be any type of cellsuspension. In some embodiments, a sample from which RNA and/or DNA isextracted (e.g., a sample of tumor, or a blood sample) is sufficientlylarge such that at least 1.8 μg RNA can be extracted from it. In someembodiments, at least 50 mg (e.g., at least 1 mg, at least 2 mg, atleast 3 mg, at least 4 mg, at least 5 mg, at least 10 mg, at least 12mg, at least 15 mg, at least 18 mg, at least 20 mg, at least 22 mg, atleast 25 mg, at least 30 mg, at least 35 mg, at least 40 mg, at least 45mg, or at least 50 mg) of tissue sample is collected from which RNAand/or DNA is extracted. In some embodiments, at least 20 mg of tissuesample is collected from which RNA and/or DNA is extracted. In someembodiments, at least 30 mg of tissue sample is collected. In someembodiments, at least 10-50 mg (e.g., 10-50 mg, 10-15 mg, 10-30 mg,10-40 mg, 20-30 mg, 20-40 mg, 20-50 mg, or 30-50 mg) of tissue sample iscollected from which RNA and/or DNA is extracted. In some embodiments,at least 30 mg of tissue sample is collected. In some embodiments, atleast 20-30 mg of tissue sample is collected from which RNA and/or DNAis extracted. In some embodiments, a sample from which RNA and/or DNA isextracted (e.g., a sample of tumor, or a blood sample) is sufficientlylarge such that at least 0.2 μg (e.g., at least 200 ng, at least 300 ng,at least 400 ng, at least 500 ng, at least 600 ng, at least 700 ng, atleast 800 ng, at least 900 ng, at least 1 μg, at least 1.1 μg, at least1.2 μg, at least 1.3 μg, at least 1.4 μg, at least 1.5 μg, at least 1.6μg, at least 1.7 μg, at least 1.8 μg, at least 1.9 μg, or at least 2 μg)of RNA can be extracted from it. In some embodiments, a sample fromwhich RNA and/or DNA is extracted (e.g., a sample of tumor, or a bloodsample) is sufficiently large such that at least 0.1 μg (e.g., at least100 ng, at least 200 ng, at least 300 ng, at least 400 ng, at least 500ng, at least 600 ng, at least 700 ng, at least 800 ng, at least 900 ng,at least 1 μg, at least 1.1 μg, at least 1.2 μg, at least 1.3 μg, atleast 1.4 μg, at least 1.5 μg, at least 1.6 μg, at least 1.7 μg, atleast 1.8 μg, at least 1.9 μg, or at least 2 μg) of RNA can be extractedfrom it.

Subjects

Aspects of this disclosure relate to a biological sample that has beenobtained from a subject. In some embodiments, a subject is a mammal(e.g., a human, a mouse, a cat, a dog, a horse, a hamster, a cow, a pig,or other domesticated animal). In some embodiments, a subject is ahuman. In some embodiments, a subject is an adult human (e.g., of 18years of age or older). In some embodiments, a subject is a child (e.g.,less than 18 years of age). In some embodiments, a human subject is onewho has or has been diagnosed with at least one form of cancer. In someembodiments, a cancer from which a subject suffers is a carcinoma, asarcoma, a myeloma, a leukemia, a lymphoma, or a mixed type of cancerthat comprises more than one of a carcinoma, a sarcoma, a myeloma, aleukemia, and a lymphoma. Carcinoma refers to a malignant neoplasm ofepithelial origin or cancer of the internal or external lining of thebody. Sarcoma refers to cancer that originates in supportive andconnective tissues such as bones, tendons, cartilage, muscle, and fat.Myeloma is cancer that originates in the plasma cells of bone marrow.Leukemias (“liquid cancers” or “blood cancers”) are cancers of the bonemarrow (the site of blood cell production). Lymphomas develop in theglands or nodes of the lymphatic system, a network of vessels, nodes,and organs (specifically the spleen, tonsils, and thymus) that purifybodily fluids and produce infection-fighting white blood cells, orlymphocytes. Non-limiting examples of a mixed type of cancer includeadenosquamous carcinoma, mixed mesodermal tumor, carcinosarcoma, andteratocarcinoma. In some embodiments, a subject has a tumor. A tumor maybe benign or malignant. In some embodiments, a cancer is any one of thefollowing: skin cancer, lung cancer, breast cancer, prostate cancer,colon cancer, rectal cancer, cervical cancer, and cancer of the uterus.In some embodiments, a subject is at risk for developing cancer, e.g.,because the subject has one or more genetic risk factors, or has beenexposed to or is being exposed to one or more carcinogens (e.g.,cigarette smoke, or chewing tobacco).

Single Cell Suspensions

In some embodiments, methods (e.g., RNA sequencing, DNA sequencing, ormultiplexed flow cytometry) to characterize a cancer that a subject hasor is suspected of having is performed at a single-cell level to capturethe heterogeneity of a single tumor or cancerous tissue, or multipletumors or cancerous tissues. That is, measurements and assessment ofsingle cells in a tumor sample provides information that is notconfounded by genotypic or phenotypic heterogeneity of bulk samples. Insome embodiments, a single cell-suspension is prepared from one or morebiological samples obtained from a subject for use in methods such assingle-cell RNA or DNA sequencing, or mass cytometry.

Accordingly, some embodiments of any one of the methods described hereincomprise forming a single-cell suspension of cells from a sample oftumor (e.g., a first sample of tumor). In some embodiments, forming asingle-cell suspension of cells comprises from a sample of tumorcomprises dissecting a tumor sample to obtain tumor sample fragments. Acurved scissor may be used to dissect a tumor tissue sample. In someembodiments, a tumor sample fragment is 0.5-3 mm³ (e.g., 1-2 mm³). Insome embodiments, a tumor tissue sample or fragment thereof is keptmoist while dissecting.

A method of preparing a single-cell suspension from a tumor sample maycomprise any one or more of the following steps in any order: finemincing, enzymatic and/or non-enzymatic digestion, vigorous pipetting,passage through a cell-strainer, washing, and counting. In someembodiments, one or more of these steps are repeated (e.g., 1 time, 2times, 3 times, 4 times, or 5 or more times).

In some embodiments, a tumor sample or tumor sample fragment/s isincubated in an enzyme cocktail. Any number and combination of enzymescan be used, see e.g., BioFiles: For Life Science Research, Issue 2,2006www.sigmaaldrich.com/content/dam/sigma-aldrich/docs/Sigma/General_Information/2/biofiles_issue2.pdf,which is incorporated herein by reference in its entirety, andespecially to incorporate herein any of enzyme or other component (e.g.,media) listed therein.

Quatromoni et al. (An optimized disaggregation method for human lungtumors that preserves the phenotype and function of the immune cells; JLeukoc Biol. 2015 January; 97(1): 201-209) provides a comparison ofdifferent enzymatic cocktails and is incorporated herein by reference inits entirety. In some embodiments, an enzyme cocktail comprises any oneor more of the following components: media (e.g., L-15 media),anti-bacterials (e.g., penicillin and/or streptomycin), anti-fungals(e.g., amphoterecin), collagenase (e.g., collagenase I, collagenase II,collagenase IV), DNAse (e.g., DNAse I), elastase, hyaluronidase, andproteases (e.g., protease XIV, trypsin, papain, or termolysin). Coll Ihas the original balance of collagenase, caseinase, clostripain, andtryptic activities; Coll II contains higher relative levels of proteaseactivity, particularly clostripain; and Coll IV is designed to beespecially low in tryptic activity (Quatromoni et al., J Leukoc Biol.2015 January; 97(1): 201-209). In some embodiments, only collagenase I,collagenase II, or collagenase IV is used. In some embodiments, amixture of two collegenases is used (e.g., collagenase I and collagenaseII, collagenase I and collagenase IV, or collagenase II and collagenaseIV). In some embodiments, more than 2 collegenases are used (e.g.,collagenase I, collagenase II, and collagenase IV).

In some embodiments, an enzyme cocktail comprises one or more of thefollowing components: media (e.g., complete media), penicillin,streptomycin, collagenase (e.g., collagenase I or collagenase IV).Concentrations of enzymes in a cocktail can be adjusted. A non-limitingexample of an enzyme cocktail is as follows: collagenase I (0.2 mg/ml),collagenase IV (1 mg/ml), complete medium, penicillin (0.001%), andDNAse.

In some embodiments, at least 25 ml (e.g., at least 25 ml, at least 26ml, at least 27 ml, at least 28 ml, at least 29 ml, or at least 30 ml)of enzyme cocktail is added per 0.5 gm of tumor tissue. In someembodiments, a sample of tumor or fragments thereof is incubated inenzyme cocktail while the sample is being shaken or agitated (e.g.,tumbled at 85 RPM, and/or vigorous pipetting). In some embodiments,sample of tumor or fragments thereof is incubated in enzyme cocktail ata temperature between 20-50° C. (e.g., 20-50° C., 20-25° C., 25-30° C.,25-35° C., 30-40° C., 35-45° C., 40-50° C., or 30-50° C.). In someembodiments, a method of preparing a single-cell suspension comprisesfiltering the enzyme cocktail, e.g., through a cell strainer (e.g., 50μm, 70 μm, or 100 μm). In some embodiments, too fine of a filter mayresult in a cell composition having a high concentration of fibroblastcells. In some embodiments, too coarse a filter may result in clumps ofcells. In some embodiments, clumps of cells are disaggregated usingmechanical force (e.g., vigorous pipetting, or applying pressure using asyringe).

In some embodiments, filtered cells are lysed using RBC lysis buffer tolyse red blood cells. RBC lysis buffers are available commercially (seee.g., www.abcam.com/red-blood-cell-rbc-lysis-buffer-ab204733.html).

In some embodiments, a method of preparing a single-cell suspensioncomprises enzymatic and mechanical dissociation. Examples of methods ofdissociating cells from tissue can be found in the followingpublications: Quatromoni et al., An optimized disaggregation method forhuman lung tumors that preserves the phenotype and function of theimmune cells; J Leukoc Biol. 2015 January; 97(1): 201-209, Pennartz etal., Generation of Single-Cell Suspensions from Mouse Neural Tissue;JOVE Issue 29; doi: 10.3791/1267; Published: Jul. 7, 2009, andwww.youtube.com/watch?v=N0jftyYqM38.

In some embodiments, cell-dissociation buffers that do not containenzymes are used. See e.g., ThermoFisher Scientific catalog numbers13151014 and 13150016, or Millipore Sigma Aldrich catalog numberS-014-B. Heng et al. (Biol Proced Online. 2009; 11: 161-169) provides acomparison of enzymatic and non-enzymatic means of dissociating cellsand is incorporated herein by reference in its entirety.

In some embodiments, the number of cells in a single-cell suspension iscounted and their viability tested. The Examples below provide anexample of an overall process of forming a single-cell suspension from asample of tumor tissue.

In some embodiments, a method comprises forming a single-cell suspensionof cells from a sample of tumor and partitioning in into at least afirst and second part. A first and second part of a single-cellssuspension may be of equal size or of different sizes (e.g., comprisinga different number of cells). In some embodiments all the parts of asingle-cell suspension (e.g., a first part, a second part, and so on)are stored in separate containers and stored under the same or similarconditions (e.g., in liquid nitrogen, or −80° C.). In some embodiments,the different parts of a single-cell suspension are stored underdifferent conditions, either before or after any further processing(e.g., labeling with antibodies for protein expression studies). In someembodiments, cells isolated from a biological sample are cultured andexpanded and then stored. In some embodiments, cells isolated from abiological sample are cultured and expanded after storage.

In some embodiments, any one of the methods described herein furthercomprises forming a lysate from at least a part (e.g., a first or secondpart) of the single-cell suspension. In some embodiments, differentparts of a single-cell suspension comprise different types of cells. Insome embodiments, a part of the single-cell suspension from which alysate is formed comprises at least 1×10⁶ cells (e.g., at least 1×10⁶cells, at least 2×10⁶ cells, at least 3×10⁶ cells, at least 4×10⁶ cells,or at least 5×10⁶ cells). In some embodiments, a part of the single-cellsuspension from which a lysate is formed comprises at least 2×10⁶ cells.Lysate may be stored in storage mediums that will prevent thedegradation of DNA and/or RNA (e.g., RNALater). In some embodiments, amethod comprises extracting RNA from the lysate from a single-cellsuspension or each part of a single-cell suspension and performing RNAsequencing on the extracted RNA to obtain RNA expression data. These RNAexpression data can be used to determine the heterogeneity of a tumor.

An overview of single-cell RNA sequencing can be found athemberg-lab.github.io/scRNA.seq.course/introduction-to-single-cell-rna-seq.html,FIG. 2.1 of which is incorporated by reference herein. In someembodiments, a method of performing RNA sequencing a single-cellsuspension comprises single-cell RNA isolation, reverse transcriptioncDNA pre-amplification, cDNA library preparation (e.g., using FluidigmC1 Protocol), and sequencing of sequenced using platforms such asIllumina HiSeq 2500.

Methods of performing single-cell RNA sequencing are described by thefollowing references, each of which is incorporated herein by referencein its entirety: Bagnoli et al. (Studying Cancer Heterogeneity bySingle-Cell RNA Sequencing; Methods Mol Biol. 2019; 1956:305-319); Sunet al. (Single-cell RNA sequencing reveals gene expression signatures ofbreast cancer-associated endothelial cells; Oncotarget. 2018 Feb. 16;9(13): 10945-10961); Kulkarni et al. (Beyond bulk: a review of singlecell transcriptomics methodologies and applications; Curr OpinBiotechnol. 2019 Apr. 9; 58:129-136); Huang et al (High ThroughputSingle Cell RNA Sequencing, Bioinformatics Analysis and Applications;Adv Exp Med Biol. 2018; 1068:33-43); Zilionis et al. (Single-CellTranscriptomics of Human and Mouse Lung Cancers Reveals ConservedMyeloid Populations across Individuals and Species; Immunity. 2019 Apr.5. pii: 51074-7613(19)30126-8); and Kashima et al. (An InformativeApproach to Single-Cell Sequencing Analysis; Adv Exp Med Biol. 2019;1129:81-96. doi: 10.1007/978-981-13-6037-4_6); Seki et al. (Single-CellDNA-Seq and RNA-Seq in Cancer Using the C1 System; Adv Exp Med Biol.2019; 1129:27-50. doi: 10.1007/978-981-13-6037-4_3); and See et al. (ASingle-Cell Sequencing Guide for Immunologists; Front Immunol. 2018; 9:2425).

Gan et al. (Identification of cancer subtypes from single-cell RNA-seqdata using a consensus clustering method; BMC Med Genomics. 2018;11(Suppl 6): 117) describes a clustering method for single-cell RNAsequencing data, which is incorporated herein in its entirety byreference.

In some embodiments, any one of the following single-cell RNA sequencingmethods is used: Fluidigm C1 system (SMART-seq), Fluidigm C1 system(mRNA Seq HT), SMART-seq2, 10× Genomics Chromium system, and MARS-seq.See et al. (Front Immunol. 2018; 9: 2425) provides a comparison of thesemethods and is incorporated herein in its entirety by reference.

In some embodiments, any one of the methods described herein furthercomprises performing measurement of a single-cell suspension. In someembodiments, different measurements are made in parallel on the samecells. Macaulay et al. (Trends Genet. 2017 February; 33(2): 155-168)describes methods of making multiple measurements from single cells andis incorporated herein by reference in its entirety.

In some embodiments, any one of the methods described herein furthercomprises performing mass cytometry on at least a first part of thesingle-cell suspension. Mass cytometry is a mass spectrometry techniquebased on inductively coupled plasma mass spectrometry and time of flightmass spectrometry used for the determination of the properties of cells.In some embodiments, mass cytometry comprises conjugating antibodieswith isotopically pure elements, and then using them to label cellularmolecules (e.g., proteins). In some embodiments, cells are nebulized andsent through an argon plasma, which ionizes the metal-conjugatedantibodies. The metal signals are then analyzed by a time-of-flight massspectrometer to identify and quantify the cellular molecules in thecells. In some embodiments, a single-cell suspension or part thereof onwhich mass cytometry is performed comprises at least 1×10⁶ cells (e.g.,at least 1×10⁶ cells, at least 2×10⁶ cells, at least 3×10⁶ cells, atleast 4×10⁶ cells, at least 5×10⁶ cells, at least 6×10⁶ cells, at least7×10⁶ cells, at least 8×10⁶ cells, at least 9×10⁶ cells, or at least10×10⁶ cells). In some embodiments, a single-cell suspension or partthereof on which mass cytometry is performed comprises at least 5×10⁶cells.

Methods of performing mass cytometry are described by the followingreferences, each of which is incorporated herein by reference in itsentirety: Galli et al. (The end of omics? High dimensional single cellanalysis in precision medicine; Eur J Immunol. 2019 February;49(2):212-220); Brodin (The biology of the cell—insights from masscytometry; FEBS J. 2018 Nov. 3. doi: 10.1111/febs.14693); Olsen et al.(The anatomy of single cell mass cytometry data; Cytometry A. 2019February; 95(2):156-172); Behbehani (Applications of Mass Cytometry inClinical Medicine: The Promise and Perils of Clinical CyTOF; Clin LabMed. 2017 December; 37(4):945-964); Gondhalekar et al. (Alternatives tocurrent flow cytometry data analysis for clinical and research studies;Methods. 2018 Feb. 1; 134-135:113-129); and Soares et al. (Go with theflow: advances and trends in magnetic flow cytometry; Anal Bioanal Chem.2019 March; 411(9):1839-1862. doi: 10.1007/s00216-019-01593-9. Epub 2019Feb. 19).

Other Assays

Any of the biological samples described herein can be used for obtainingexpression data using conventional assays or those described herein.Expression data, in some embodiments, includes gene expression levels.Gene expression levels may be detected by detecting a product of geneexpression such as mRNA and/or protein.

In some embodiments, gene expression levels are determined by detectinga level of a protein in a sample and/or by detecting a level of activityof a protein in a sample. As used herein, the terms “determining” or“detecting” may include assessing the presence, absence, quantity and/oramount (which can be an effective amount) of a substance within asample, including the derivation of qualitative or quantitativeconcentration levels of such substances, or otherwise evaluating thevalues and/or categorization of such substances in a sample from asubject.

The level of a protein may be measured using an immunoassay. Examples ofimmunoassays include any known assay (without limitation), and mayinclude any of the following: immunoblotting assay (e.g., Western blot),immunohistochemical analysis, flow cytometry assay, immunofluorescenceassay (IF), enzyme linked immunosorbent assays (ELISAs) (e.g., sandwichELISAs), radioimmunoas says, electrochemiluminescence-based detectionassays, magnetic immunoassays, lateral flow assays, and relatedtechniques. Additional suitable immunoassays for detecting a level of aprotein provided herein will be apparent to those of skill in the art.

Such immunoassays may involve the use of an agent (e.g., an antibody)specific to the target protein. An agent such as an antibody that“specifically binds” to a target protein is a term well understood inthe art, and methods to determine such specific binding are also wellknown in the art. An antibody is said to exhibit “specific binding” ifit reacts or associates more frequently, more rapidly, with greaterduration and/or with greater affinity with a particular target proteinthan it does with alternative proteins. It is also understood by readingthis definition that, for example, an antibody that specifically bindsto a first target peptide may or may not specifically or preferentiallybind to a second target peptide. As such, “specific binding” or“preferential binding” does not necessarily require (although it caninclude) exclusive binding. Generally, but not necessarily, reference tobinding means preferential binding. In some examples, an antibody that“specifically binds” to a target peptide or an epitope thereof may notbind to other peptides or other epitopes in the same antigen. In someembodiments, a sample may be contacted, simultaneously or sequentially,with more than one binding agent that binds different proteins (e.g.,multiplexed analysis).

As used herein, the term “antibody” refers to a protein that includes atleast one immunoglobulin variable domain or immunoglobulin variabledomain sequence. For example, an antibody can include a heavy (H) chainvariable region (abbreviated herein as VH), and a light (L) chainvariable region (abbreviated herein as VL). In another example, anantibody includes two heavy (H) chain variable regions and two light (L)chain variable regions. The term “antibody” encompasses antigen-bindingfragments of antibodies (e.g., single chain antibodies, Fab and sFabfragments, F(ab′)2, Fd fragments, Fv fragments, scFv, and domainantibodies (dAb) fragments (de Wildt et al., Eur J Immunol. 1996;26(3):629-39.)) as well as complete antibodies. An antibody can have thestructural features of IgA, IgG, IgE, IgD, IgM (as well as subtypesthereof). Antibodies may be from any source including, but not limitedto, primate (human and non-human primate) and primatized (such ashumanized) antibodies.

In some embodiments, the antibodies as described herein can beconjugated to a detectable label and the binding of the detectionreagent to the peptide of interest can be determined based on theintensity of the signal released from the detectable label.Alternatively, a secondary antibody specific to the detection reagentcan be used. One or more antibodies may be coupled to a detectablelabel. Any suitable label known in the art can be used in the assaymethods described herein. In some embodiments, a detectable labelcomprises a fluorophore. As used herein, the term “fluorophore” (alsoreferred to as “fluorescent label” or “fluorescent dye”) refers tomoieties that absorb light energy at a defined excitation wavelength andemit light energy at a different wavelength. In some embodiments, adetection moiety is or comprises an enzyme. In some embodiments, anenzyme is one (e.g., β-galactosidase) that produces a colored productfrom a colorless substrate.

It will be apparent to those of skill in the art that this disclosure isnot limited to immunoassays. Detection assays that are not based on anantibody, such as mass spectrometry, are also useful for the detectionand/or quantification of a protein and/or a level of protein as providedherein. Assays that rely on a chromogenic substrate can also be usefulfor the detection and/or quantification of a protein and/or a level ofprotein as provided herein.

Alternatively, the level of nucleic acids encoding a gene in a samplecan be measured via a conventional method. In some embodiments,measuring the expression level of nucleic acid encoding the genecomprises measuring mRNA. In some embodiments, the expression level ofmRNA encoding a gene can be measured using real-time reversetranscriptase (RT) Q-PCR or a nucleic acid microarray. Methods to detectnucleic acid sequences include, but are not limited to, polymerase chainreaction (PCR), reverse transcriptase-PCR (RT-PCR), in situ PCR,quantitative PCR (Q-PCR), real-time quantitative PCR (RT Q-PCR), in situhybridization, Southern blot, Northern blot, sequence analysis,microarray analysis, detection of a reporter gene, or other DNA/RNAhybridization platforms.

In some embodiments, the level of nucleic acids encoding a gene in asample can be measured via a hybridization assay. In some embodiments,the hybridization assay comprises at least one binding partner. In someembodiments, the hybridization assay comprises at least oneoligonucleotide binding partner. In some embodiments, the hybridizationassay comprises at least one labeled oligonucleotide binding partner. Insome embodiments, the hybridization assay comprises at least one pair ofoligonucleotide binding partners. In some embodiments, the hybridizationassay comprises at least one pair of labeled oligonucleotide bindingpartners.

Any binding agent that specifically binds to a desired nucleic acid orprotein may be used in the methods and kits described herein to measurean expression level in a sample. In some embodiments, the binding agentis an antibody or an aptamer that specifically binds to a desiredprotein. In other embodiments, the binding agent may be one or moreoligonucleotides complementary to a nucleic acid or a portion thereof.In some embodiments, a sample may be contacted, simultaneously orsequentially, with more than one binding agent that binds differentproteins or different nucleic acids (e.g., multiplexed analysis).

To measure an expression level of a protein or nucleic acid, a samplecan be in contact with a binding agent under suitable conditions. Ingeneral, the term “contact” refers to an exposure of the binding agentwith the sample or cells collected therefrom for suitable periodsufficient for the formation of complexes between the binding agent andthe target protein or target nucleic acid in the sample, if any. In someembodiments, the contacting is performed by capillary action in which asample is moved across a surface of the support membrane.

In some embodiments, an assay may be performed in a low-throughputplatform, including single assay format. In some embodiments, an assaymay be performed in a high-throughput platform. Such high-throughputassays may comprise using a binding agent immobilized to a solid support(e.g., one or more chips). Methods for immobilizing a binding agent willdepend on factors such as the nature of the binding agent and thematerial of the solid support and may require particular buffers. Suchmethods will be evident to one of ordinary skill in the art.

Extraction of DNA and/or RNA

In some embodiments of any one of the methods described herein, RNA isextracted from a biological sample to prevent it from being degradedand/or to prevent the inhibition of enzymes in downstream processing,e.g., the preparation of DNA (i.e., a cDNA library from RNA). In someembodiments of any one of the methods described herein, DNA is extractedfrom a biological sample to prevent it from being degraded and/or toprevent the inhibition of enzymes in downstream processing, e.g., thepreparation of DNA. In some embodiments, the term “extraction” in thecontext of obtaining DNA or RNA from a biological sample is usedinterchangeably with the term “isolation.”

Methods described herein involve extraction of RNA and/or DNA from abiological sample (e.g., a tumor sample or sample of blood). Asdescribed above, a biological sample may be comprised of more than onesample from one or more than one tissues (e.g., one or more than onedifferent tumors). In some embodiments, RNA and/or DNA are extractedfrom a combined sample. In some embodiments, RNA and or DNA is extractedfrom multiple biological samples from a subject, and then combinedbefore further processing (e.g., storage, or DNA library preparation).In some embodiments, more than one sample of extracted RNA and/or DNAare combined with each other after retrieval from storage. In someembodiments, at least tumor DNA is extracted from one or more tumortissues. In some embodiments, at least tumor RNA is extracted from oneor more tumor tissues. In some embodiments, at least normal DNA isextracted from one of more normal tissues to serve as a control. In someembodiments, at least normal RNA is extracted from one of more normaltissues to serve as a control. Protocols of DNA/RNA extraction can befound at least in Example 2.

Methods for extracting DNA and/or RNA from biological samples are knownin the art, and reagents and kits for doing so are commerciallyavailable. Gömez-Acata et al. (Methods for extracting ‘omes frommicrobialites, J Microbiol Methods. 2019 Mar. 12; 160:1-10) describesmethods for extracting applied for DNA and RNA extraction frommicrobialites and describes their advantages and disadvantages and isincorporated herein by reference in its entirety. The methods describedin Gómez-Acata et al. are generally applicable for RNA and/or DNAextracted from tissue. Moore (Curr Protoc Immunol. 2001 May; Chapter10:Unit 10.1) describes purification and concentration of DNA fromaqueous solutions and is also incorporated by reference herein in itsentirety.

In some embodiments, extracting DNA and/or RNA comprises lysing cells ofa biological sample and isolating DNA and/or RNA from other cellularcomponents. Examples of methods for lysing cells include, but are notlimited to, mechanical lysis, liquid homogenization, sonication,freeze-thaw, chemical lysis, alkaline lysis, and manual grinding.

Methods for extracting DNA and/or RNA include, but are not limited to,solution phase extraction methods and solid-phase extraction methods. Insome embodiments, a solution phase extraction method comprises anorganic extraction method, e.g., a phenol chloroform extraction method.In some embodiments, a solution phase extraction method comprises a highsalt concentration extraction method, e.g., guanidinium thiocyantate(GuTC) or guanidinium chloride (GuCl) extraction method. In someembodiments, a solution phase extraction method comprises an ethanolprecipitation method. In some embodiments, a solution phase extractionmethod comprises an isopropanol precipitation method. In someembodiments, a solution phase extraction method comprises an ethidiumbromide (EtBr)-Cesium Chloride (CsCl) gradient centrifugation method. Insome embodiments, extracting DNA and/or RNA comprises a nonionicdetergent extraction method, e.g., a cetyltrimethylammonium bromide(CTAB) extraction method.

In some embodiments, extracting DNA and/or RNA comprises a solid phaseextraction method. Any solid phase that binds to DNA and/or RNA may beused for extracting DNA and/or RNA in methods and systems describedherein. Examples of solid phases that bind DNA and/or RNA include, butare not limited to, silica matrices, ion exchange matrices, glassparticles, magnetizable cellulose beads, polyamide matrices, andnitrocellulose membranes.

In some embodiments, a solid phase extraction method comprises aspin-column based extraction method. In some embodiments, a solid phaseextraction method comprises a bead-based extraction method. In someembodiments, a solid phase extraction method comprises a cation exchangeresin, e.g., a styrene divinylbenzene copolymer resin.

Systems and methods described herein encompass extracting DNA and/or RNAfrom a single biological sample or a plurality of biological samples. Insome embodiments, extracting DNA comprises extracting DNA from a singlesample. In some embodiments, extracting DNA comprises extracting DNAfrom a plurality of samples. In some embodiments, extracting DNAcomprises extracting DNA from a first sample and a second sample. Insome embodiments, extracting DNA comprises extracting DNA from one ormore, two or more, three or more, four or more, five or more, six ormore, seven or more, eight or more, nine or more, or ten or moresamples.

In some embodiments, extracting RNA comprises extracting RNA from asingle sample. In some embodiments, extracting RNA comprises extractingRNA from a plurality of samples. In some embodiments, extracting RNAcomprises extracting RNA from a first sample and a second sample. Insome embodiments, extracting RNA comprises extracting RNA from one ormore, two or more, three or more, four or more, five or more, six ormore, seven or more, eight or more, nine or more, or ten or moresamples.

Extracted DNA and/or RNA from a biological sample may be combined withextracted DNA and/or RNA from another biological sample. This may beaccomplished by combining one or more biological samples and extractingnucleic acids or by combining nucleic acids extracted from one or morebiological samples. In some embodiments, a first biological sample iscombined with a second biological sample to form a combined sample andextracting DNA and/or RNA from the combined sample. In some embodiments,extracted DNA and/or RNA from a first biological sample may be combinedwith extracted DNA and/or RNA from a second biological sample.

Systems and methods described herein encompass extracting any type ofDNA and/or RNA from a biological sample. In some embodiments, extractingDNA comprises extracting genomic DNA (gDNA). In some embodiments,extracting DNA comprises extracting mitochondrial DNA. In someembodiments, extracting RNA comprises extracting messenger RNA (mRNA).In some embodiments, extracting RNA comprises extracting precursor mRNA(pre-mRNA). In some embodiments, extracting RNA comprises extractingribosomal RNA (rRNA). In some embodiments, extracting RNA comprisesextracting transfer RNA (tRNA).

In some embodiments, a single kit is used to purity DNA and RNA from thesame sample. A non-limiting example of kit for doing so is the QiagenAllPrep DNA/RNA kit. In some embodiments, robotics is employed to carryout DNA and/or RNA extraction.

In some embodiments, if a sample of extracted RNA is not of sufficientyield and/or quality, anyone of the following outcome may occur. First,there may be an overrepresentation of common transcripts in RNAsequencing data, and under representation of low abundance transcripts.Second, poor quality RNA can lead to insufficient read lengths (i.e.,reads are shorter) and/or inadequate read quality leading to potentialmisidentification of RNA.

For whole exome sequencing, poor quantity and quality of DNA can lead tomisidentification of base pairs leading to false variant discovery(e.g., a false positive) or incidences where variants are not identified(e.g., a false negative). Another problem that can arise resulting fromlow DNA quantity and/or quality is inadequate coverage of the exome(e.g., missing sequences).

In some embodiments, before extracted RNA and/or DNA is processedfurther for RNA sequencing or whole exome sequencing (WES), the qualityand/or quantity of RNA or DNA is checked. In some embodiments, a sampleof extracted RNA is at least 1000-6000 ng in total mass. In someembodiments, a sample of extracted RNA is at least 100-60000 ng (e.g.,100-60000 ng, 500-30000 ng, 800-20000 ng, 1000-15000 ng, 1000-10000 ng,1000-8000 ng, 1000-6000 ng, 10000-20000 ng, 20000-60000 ng) in totalmass. In some embodiments, the acceptable total RNA amount for furthersequencing is at least 100-1,000 ng (e.g., 100-1,000 ng, 500-1,000 ng,or 300-900 ng). In some embodiments, the target total RNA amount forfurther sequencing is more than 200-1,000 ng (e.g., 200-1,000 ng,500-1,000 ng, or 300-1,000 ng). In some embodiments, the purity of asample of extracted RNA is such that it corresponds to a ratio ofabsorbance at 260 nm to absorbance at 280 nm of at least 1 (e.g., atleast 1, at least 1.2, at least 1.4, at least 1.6, at least 1.8, or atleast 2). In some embodiments, the purity of a sample of extracted RNAis such that it corresponds to a ratio of absorbance at 260 nm toabsorbance at 280 nm of at least 2. The ratio of absorbance at 260 nmand 280 nm is used to assess the purity of DNA and RNA. A ratio of ˜1.8is generally accepted as “pure” for DNA; a ratio of ˜2.0 is generallyaccepted as “pure” for RNA. If the ratio is appreciably lower in eithercase, it may indicate the presence of protein, phenol or othercontaminants that absorb strongly at or near 280 nm. Absorbances can bemeasured using a spectrophotometer.

In some embodiments, the purity or integrity of extracted RNA or DNA(e.g., a DNA fragment library) by any one of the methods describedherein is such that it corresponds to a RNA integrity number (RIN) of atleast 4 (e.g., at least 4, at least 5, at least 6, at least 7, at least8, or at least 9). In some embodiments, the purity of extracted nucleicacid (e.g., RNA or DNA) by any one of the methods described herein issuch that it corresponds to a RNA integrity number (RIN) of at least 7.RIN has been demonstrated to be robust and reproducible in studiescomparing it to other RNA integrity calculation algorithms, cementingits position as a preferred method of determining the quality of RNA tobe analyzed (Imbeaud et al., Towards standardization of RNA qualityassessment using user-independent classifiers of microcapillaryelectrophoresis traces; Nucleic Acids Research. 33 (6): e56).

In some embodiments, a sample of extracted DNA is at least 100-20000 ng(e.g., 100-20000 ng, 500-15000 ng, 800-10000 ng, 1000-15000 ng,1000-10000 ng, 1000-8000 ng, 1000-6000 ng, or 1000-2000 ng) in totalmass. In some embodiments, a sample of extracted DNA is at least1000-2000 ng in total mass. In some embodiments, the acceptable totalDNA amount for further sequencing is at least 20-200 ng (e.g., 20-200ng, 30-200 ng, or 50-150 ng). In some embodiments, the target total DNAamount for further sequencing is more than 30-200 ng (e.g., 30-200 ng,50-200 ng, or 100-200 ng). In some embodiments, the target purity of asample of extracted DNA is such that it corresponds to a range of aratio of absorbance at 260 nm to absorbance at 280 nm of at least 1.8-2(e.g., at least 1.8-2, at least 1.8-1.9). In some embodiments, thepurity of a sample of extracted DNA is such that it corresponds to aratio of absorbance at 260 nm to absorbance at 280 nm of at least 1(e.g., at least 1, at least 1.2, at least 1.4, at least 1.6, at least1.8, or at least 2). In some embodiments, the acceptable purity of asample of extracted DNA is such that it corresponds to a ratio ofabsorbance at 260 nm to absorbance at 280 nm of at least 1.5 (e.g., atleast 1.5, at least 1.7, at least 2). In some embodiments, the targetpurity of a sample of extracted DNA is such that it corresponds to arange of a ratio of absorbance at 260 nm to absorbance at 230 nm of atleast 2-2.2 (e.g., at least 2-2.2, at least 2-2.1). In some embodiments,the acceptable purity of a sample of extracted DNA is such that itcorresponds to a ratio of absorbance at 260 nm to absorbance at 230 nmof at least 1.5 (e.g., at least 1.5, at least 1.7, at least 2). In someembodiments, the purity of a sample of extracted DNA as described hereinis analyzed by a spectrophotometer, for example a small volumefull-spectrum, UV-visible spectrophotometer (e.g., Nanodropspectrophotometer available from ThermoFisher Scientific,www.thermofisher.com).

In some embodiments, a sample of extracted DNA has a targetconcentration of at least 4.5 ng/μl (e.g., 4.5 ng/μl, 5.5 ng/μl, 6.5ng/μl). In some embodiments, a sample of extracted DNA has an acceptableconcentration of at least 3 ng/μl (e.g., 3 ng/μl, 5 ng/μl, 10 ng/μl). Insome embodiments, the concentration of the extracted DNA is performed bya fluorometer, for example for quantification of DNA or RNA (e.g., aQubit fluorometer available from ThermoFisher Scientific,www.thermofisher.com).

In some embodiments, a sample of extracted DNA has a targetconcentration of at least 4 ng/μl (e.g., 4 ng/μl, 6 ng/μl, 8 ng/μl). Insome embodiments, a sample of extracted DNA has an acceptableconcentration of at least 2.5 ng/μl (e.g., 2.5 ng/μl, 4.5 ng/μl, 5.5ng/μl). In some embodiments, the concentration of the extracted DNA isperformed by Tapestation.

In some embodiments, a sample of extracted RNA has a targetconcentration of at least 2 ng/μl (e.g., 2 ng/μl, 4 ng/μl, 6 ng/μl). Insome embodiments, a sample of extracted RNA has an acceptableconcentration of at least 4 ng/μl (e.g., 4 ng/μl, 6 ng/μl, 10 ng/μl). Insome embodiments, the concentration of the extracted DNA is performed bya fluorometer, for example for quantification of DNA or RNA (e.g., aQubit fluorometer available from ThermoFisher Scientific,www.thermofisher.com).

In some embodiments, a sample of extracted RNA has a targetconcentration of at least 4 ng/μl (e.g., 4 ng/μl, 6 ng/μl, 8 ng/μl). Insome embodiments, a sample of extracted RNA has an acceptableconcentration of at least 1.5 ng/μl (e.g., 1.5 ng/μl, 3.5 ng/μl, 5.5ng/μl). In some embodiments, the concentration of the extracted RNA isperformed by Tapestation. In some embodiments, the acceptable RNAintegrity number (RIN) is at least 5 (e.g., 5, 6, 7). In someembodiments, the target RNA integrity number (RIN) is at least 8 (e.g.,8, 9, 10). In some embodiments, the RIN is performed by Tapestation.

In some embodiments, the target purity of a sample of extracted RNA issuch that it corresponds to a range of a ratio of absorbance at 260 nmto absorbance at 280 nm of at least 1.8-2 (e.g., at least 1.8-2, atleast 1.8-1.9). In some embodiments, the purity of a sample of extractedRNA is such that it corresponds to a ratio of absorbance at 260 nm toabsorbance at 280 nm of at least 1.8. In some embodiments, theacceptable purity of a sample of extracted RNA is such that itcorresponds to a ratio of absorbance at 260 nm to absorbance at 280 nmof at least 1.5 (e.g., at least 1.5, at least 1.7, at least 2). In someembodiments, the target purity of a sample of extracted RNA is such thatit corresponds to a range of a ratio of absorbance at 260 nm toabsorbance at 230 nm of at least 2-2.2 (e.g., at least 2-2.2, at least2-2.1). In some embodiments, the acceptable purity of a sample ofextracted RNA is such that it corresponds to a ratio of absorbance at260 nm to absorbance at 230 nm of at least 1.5 (e.g., at least 1.5, atleast 1.7, at least 2). In some embodiments, the purity of a sample ofextracted RNA as described herein is analyzed by a spectrophotometer,for example a small volume full-spectrum, UV-visible spectrophotometer(e.g., Nanodrop spectrophotometer available from ThermoFisherScientific, www.thermofisher.com). In some embodiments, theconcentration of extracted DNA is at least 10-2000 ng/μl (e.g., 10-2000ng/μl, 10-1000 ng/μl, 10-200 ng/μl, 1-200 ng/μl, 0.5-400 ng/μl, 0.5-200ng/μl, 100-200 ng/μl, 100-400 ng/μl, 100-500 ng/μl, 50-500 ng/μl, or50-250 ng/μl).

Protocols for quality control of a sample of extracted RNA or DNA can befound at least in Example 6. In some embodiments, the purity of a sampleof extracted DNA and/or RNA as described herein can be analyzed by anyother suitable technologies or tools. In some embodiments, a sample ofextracted RNA or DNA is not processed further if it does not meet aparticular quantity or purity standard as described above. In someembodiments, if a sample of extracted RNA or DNA does not meet aparticular quantity or purity standard, it is combined with anothersample.

Library Preparation for RNA Sequencing

Methods of preparing cDNA libraries from a sample of RNA are known inthe art. For example,www.illumina.com/content/dam/illumina-marheting/documents/applications/ngs-library-prep/for-all-you-seq-rna.pdfprovides illustrations of different methods for preparing cDNA librariesfor RNA sequencing. Non-limiting examples of cDNA library preparationinclude ClickSeq, 3Seq, and cP-RNA-Seq. In some embodiments, preparing acDNA library from RNA comprises purifying mRNA from the sample of RNA(RNA enrichment). In some embodiments, enriched RNA is fragmented. Insome embodiments, after selection of the appropriate RNA fraction iscompleted, the molecules are fragmented into smaller pieces, to a sizebetween 50-1000 bp (e.g., 50-100 bp, 100-800 bp, 100-500 bp, or 200-500bp) depending on the sequencing platform being used. This fragmentationcan be achieved either by fragmenting double-stranded (ds) cDNA or byfragmenting RNA. Both methods result in the same end product of a doublestranded cDNA library in which each fragment has an adapter attached.

In some embodiments, a library preparation method comprises one or moreamplification steps to add function elements (e.g., sample indices,molecular barcodes or flow cell oligo binding sites), enrich forsequencing-competent DNA fragments, and/or generate a sufficient amountof library DNA for downstream processing. In some embodiments, enrichedRNA (e.g., fragmented enriched RNA) is amplified using random primers(e.g., random hexamers). In some embodiments, enriched RNA (e.g.,fragmented enriched RNA) is amplified using oligodTs. In someembodiments, RNA is then removed from the formed cDNA. In someembodiments, cDNA is amplified to include sequencing adapters andindices (i.e., a plurality of indexes). An adapter is a DNA sequence of10-100 bp (e.g., 10-20, 10-100, 20-80, 30-70, 40-60, 20-100, 40-100,40-80, 30-60, or 45-65 bp) that can bind to a flow cell for sequencing.Adapters also allow for PCR enrichment of adapter-ligated DNA fragments.Adapters also can allow for indexing or barcoding of samples so thatmultiple cDNA libraries can be mixed together into one sequencing sample(or lane); i.e., it allows for multiplexing. In some embodiments, anindex or a barcode is 4-20 bp long (e.g., 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 4-20, 5-15, 6-12, or 4-12 bp long).tucf-genomics.tufts.edu/documents/protocols/TUCF_Understanding_Illumina_TruSeq_Adapters.pdfprovides example protocols for preparing cDNA libraries using adaptersand indexing and is incorporated herein by reference in its entirety.Protocols for constructing a DNA or RNA library can at least be found inExample 3 and Example 5.

RNA Enrichment

Methods for RNA enrichment to enrich for mRNA (herein also described as“RNA enrichment”) during cDNA library preparation are known in the art.RNA enrichment can be targeted or non-targeting. Targeted methods of RNAenrichment include use of sequence-specific capture probes. Anon-limiting example of targeted mRNA enrichment includes CaptureSeq(sapac.illumina.com/science/sequencing-method-explorer/kits-and-arrays/captureseq.html),which makes use of capture probes specific to sequences of interest.Other platforms or tools suitable for targeted mRNA enrichment can alsobe used.

Examples of non-targeted mRNA enrichment methods include polyA captureusing oligodT (e.g., those on conjugated to beads), and rRNA depletion.Petrova et al. (Scientific Reports volume 7, Article number: 41114(2017) provides a comparison of various rRNA depletion methods and isincorporated by reference herein in its entirety. In some embodiments,rRNA depletion can be performed using enzymatic approaches (e.g., usingan exonuclease that does not process mRNA). In some embodiments, rRNAdepletion method comprises subtractive hybridization, whereby rRNA iscaptured using sequence specific probes (see e.g.,www.sciencedirect.com/topics/immunology-and-microbiology/subtractive-hybridization)

In some embodiments, polyA capture comprises capturing of mRNA bearing apolyA tail by using polyA-specific capture probes (oligodT). In someembodiments, capture probes are immobilized for ease of purification. Insome embodiments, capture probes are immobilized on beads (e.g.,magnetic beads). In some embodiments, a commercial kit is used toprepare a DNA library from an RNA sample. In some embodiments, anIllumina TruSeq RNA Library Prep kit is used.

The choice of mRNA enrichment can have a huge impact on the selection oftranscripts that are sequenced. For example, in some embodiments,compared to rRNA depletion methods, cDNA libraries prepared using polyAenrichment result in libraries comprising a higher fraction ofprotein-coding transcripts (e.g., greater than 80%, greater than 90%,greater than 95%, greater than 96%, greater than 97%, greater than 98%,greater than 99%, or greater than 99.9%) compared to non-codingtranscripts (e.g., rRNA, miRNA, and IncRNA).

In some embodiments, prepared cDNA libraries are tested for quality. Insome embodiments, quantification of libraries for use in sequencing isgenerally performed before the libraries are pooled for targetenrichment or amplification to ensure equal representation of indexedlibraries in multiplexed applications. In some embodiments,quantification is also used to confirm that individual libraries orlibrary pools are diluted optimally prior to sequencing. Accurate andreproducible quantification of adapter-ligated library moleculescontributes to obtaining consistent and reproducible results, and formaximizing sequencing yields. Loading more than the recommended amountof DNA could lead to saturation of the flowcell or increased clusterdensity while loading too little DNA can lead to decreased clusterdensity and reduced coverage and depth.

Methods of quantifying DNA libraries include electrophoresis,fluorometry, spectrophotometry, digital PCR, droplet-digital PCR andqPCR. Various instruments for measuring the quantity and/or quality ofDNA libraries exist, e.g., the Agilent High Sensitivity D1000 ScreenTapeSystem.

Aspects of the present disclosure provide quality control of nucleicacids for the sequencing analysis. Aspects of the present disclosureprovide quality control of DNA for the sequencing analysis. Aspects ofthe present disclosure provide quality control of RNA for the sequencinganalysis. In some embodiments, the nucleic acids can include anysuitable types of DNA or RNA. In some embodiments, the quality controlof nucleic acid comprises the confirmation of biopsy condition anddocuments. In some embodiments, the confirmation of biopsy condition anddocuments can include, but are not limited to the inventory andregistration of the nucleic acid materials. In some embodiments, theconfirmation of biopsy condition and documents include nucleic acidmaterial acceptance. By way of example, patient samples received from ahealthcare provider are confirmed whether the patient tissues are in thecondition of fresh frozen or formalin-fixed paraffin-embedded. Thelaboratory personnel verify the compliance of the biopsy of theregistered entity. The laboratory personnel verify the proper storage ofthe biopsy sample during transportation. The laboratory personnel verifythe physical condition of the biopsy samples. In the event that thelaboratory personnel identify any errors regarding the biopsy samples,the source of the biopsy samples (e.g., a healthcare provider) may benotified. In some embodiments, if the received biopsy samples arepatient tissue cell lines, the samples are prepared for extraction. Insome embodiments, if the received biopsy samples are extracted DNA orRNA, the samples are stored at −80° C. for further sequencing. In someembodiments, the extracted DNA can be a reference gDNA. In someembodiments, the extracted RNA can be a reference RNA.

In some embodiments, the quality control procedures provide a targetrange. The target range may represent the most ideal quality of a givenstep (e.g., extraction). In some embodiments, the quality controlprocedures provide an acceptable range. The acceptable range mayrepresent ideal or acceptable quality of a given step. In someembodiments, the quality control of nucleic acid comprises ensuring thequality in the process of constructing a DNA library. In someembodiments, the quality control of nucleic acid comprises ensuring thequality in the process of constructing an RNA library. As shown in FIG.7 and Example 6, the preparation of the DNA or RNA libraries comprisesthe extraction of the DNA or RNA from the patient tissue samples. Insome embodiments, a spectrophotometer, for example a small volumefull-spectrum, UV-visible spectrophotometer (e.g., Nanodropspectrophotometer available from ThermoFisher Scientific,www.thermofisher.com) can be used for determining the quality of the DNAor RNA extraction. By way of example, the extracted DNA at >100 ng/μlshows that the extracted DNA passes the quality control test. Theextracted RNA at >500 ng/μl shows that the extracted RNA passes thequality control test. In another example, the ratio of absorbance at 260nm and 280 nm (260/280) of the extracted DNA at 1.8-2.0 shows that theextracted DNA passes the quality control test. The ratio of absorbanceat 260 nm and 280 nm (260/280) of the extracted RNA at 2.0 shows thatthe extracted RNA passes the quality control test. In another example,the ratio of absorbance at 260 nm and 230 nm (260/230) of the extractedDNA at 2.0-2.2 shows that the extracted DNA passes the quality controltest. The ratio of absorbance at 260 nm and 230 nm (260/230) of theextracted RNA at 2.0-2.2 shows that the extracted RNA passes the qualitycontrol test. In some embodiments, a fluorometer, for example forquantification of DNA or RNA (e.g., a Qubit fluorometer available fromThermoFisher Scientific, www.thermofisher.com) can be used fordetermining the quality of the DNA or RNA extraction. In someembodiments, an electrophoresis device, for example an automatedelectrophoresis device (e.g., a TapeStation System available fromAgilent, www.agilent.com) can be used for determining the quality of theDNA or RNA extraction. In some embodiments, any suitable technology ortool can be used for determining the quality of the DNA or RNAextraction.

In some embodiments, the acceptable total DNA amount for further DNAlibrary construction is at least 200-1,000 ng (e.g., 200-1,000 ng,300-1,000 ng, or 300-1,000 ng). In some embodiments, the target totalDNA amount for further sequencing is more than 500-1,000 ng (e.g.,500-1,000 ng, 600-1,000 ng, or 800-1,000 ng). In some embodiments, theacceptable total RNA amount for further RNA library construction is atleast 0.5-4 nmol/l (e.g., 200-1,000 ng, 300-1,000 ng, or 300-1,000 ng).In some embodiments, the target total RNA amount for further RNA libraryconstruction is at least 0.5-4 nmol/l (e.g., 500-1,000 ng, 600-1,000 ng,or 800-1,000 ng).

In some embodiments, the acceptable DNA concentration for further DNAlibrary construction is at least 17 ng/μl (e.g., 17 ng/μl, 25 ng/μl, 35ng/μl). In some embodiments, the target DNA concentration for furtherDNA library construction is at least 42 ng/μl (e.g., 42 ng/μl, 50 ng/μl,80 ng/μl). In some embodiments, the acceptable RNA concentration forfurther RNA library construction is at least 0.1 ng/μl (e.g., 0.1 ng/μl,1 ng/μl, 3 ng/μl). In some embodiments, the target RNA concentration forfurther RNA library construction is at least 0.1 ng/μl (e.g., 0.1 ng/μl,1 ng/μl, 3 ng/μl). In some embodiments, the DNA and RNA concentration isdetected by a fluorometer, for example for quantification of DNA or RNA(e.g., a Qubit fluorometer available from ThermoFisher Scientific,www.thermofisher.com).

In some embodiments, the acceptable DNA concentration for further DNAlibrary construction is at least 15 ng/μl (e.g., 15 ng/μl, 25 ng/μl, 35ng/μl). In some embodiments, the target DNA concentration for furtherDNA library construction is at least 402 ng/μl (e.g., 40 ng/μl, 50ng/μl, 80 ng/μl). In some embodiments, the acceptable RNA concentrationfor further RNA library construction is at least 0.1 ng/μl (e.g., 0.1ng/μl, 1 ng/μl, 3 ng/μl). In some embodiments, the target RNAconcentration for further RNA library construction is at least 0.1 ng/μl(e.g., 0.1 ng/μl, 1 ng/μl, 3 ng/μl). In some embodiments, the acceptableRNA concentration for further RNA library construction is at least 0.5nmol/l (e.g., 0.5 nmol/l, 1 nmol/l, 5 nmol/l). In some embodiments, thetarget RNA concentration for further RNA library construction is atleast 0.5 nmol/l (e.g., 0.5 nmol/l, 1 nmol/l, 5 nmol/l). In someembodiments, the DNA and RNA concentrations are detected by Tapestation.

In some embodiments, the acceptable RNA concentration for further RNAlibrary construction is at least 0.5 nmol/l (e.g., 0.5 nmol/l, 1 nmol/l,5 nmol/l). In some embodiments, the target RNA concentration for furtherRNA library construction is at least 0.5 nmol/l (e.g., 0.5 nmol/l, 1nmol/l, 5 nmol/l). In some embodiments, the DNA and RNA concentration isdetected by a nucleic acid amplification device (e.g., a PCR system),for example a real-time PCR system (e.g., a LightCycler Instrumentavailable from Roche, www.lifescience.roche.com). In some embodiments,the DNA and RNA concentration can be detected by any suitabletechnologies or tools.

In some embodiments, if RNA is extracted, reverse transcription can beperformed. In some embodiments, an RNA library can be constructed afterthe reverse transcription is performed. In some embodiments, afluorometer, for example for quantification of DNA or RNA (e.g., a Qubitfluorometer available from ThermoFisher Scientific,www.thermofisher.com) can be used for determining the quality of the DNAor RNA libraries. In some embodiments, any suitable method can be usedfor determining the quality of the DNA or RNA libraries. In someembodiments, an electrophoresis device, for example an automatedelectrophoresis device (e.g., a TapeStation System available fromAgilent, www.agilent.com) can be used for determining the quality of theDNA or RNA libraries. In some embodiments, a fluorometer, for examplefor quantification of DNA or RNA (e.g., a Qubit fluorometer availablefrom ThermoFisher Scientific, www.thermofisher.com) can be used fordetermining the quality of the DNA or RNA extraction. In someembodiments, a nucleic acid amplification device (e.g., a PCR system),for example a real-time PCR system (e.g., a LightCycler Instrumentavailable from Roche, www.lifescience.roche.com) can be used fordetermining the quality of the RNA library. In some embodiments, one ormore RNA libraries can be pooled. In some embodiments, if DNA isextracted, the extracted DNA can be used for a DNA library construction.In some embodiments, the DNA fragments in the constructed DNA librarycan be hybridized and/or captured. In some embodiments, a fluorometer,for example for quantification of DNA or RNA (e.g., a Qubit fluorometeravailable from ThermoFisher Scientific, www.thermofisher.com) can beused for determining the quality of the DNA hybridization and capturestep. In some embodiments, an electrophoresis device, for example anautomated electrophoresis device (e.g., a TapeStation System availablefrom Agilent, www.agilent.com) can be used for determining the qualityof the DNA hybridization and capture step. In some embodiments, anucleic acid amplification device (e.g., a PCR system), for example areal-time PCR system (e.g., a LightCycler Instrument available fromRoche, www.lifescience.roche.com) can be used for determining thequality of the DNA hybridization and capture step. In some embodiments,any suitable method can be used for determining the quality of the DNAhybridization and capture step. In some embodiments, one or more DNAlibraries can be pooled. In some embodiments, an electrophoresis device,for example an automated electrophoresis device (e.g., a TapeStationSystem available from Agilent, www.agilent.com) can be used fordetermining the quality of the DNA or RNA library pooling. In someembodiments, any suitable methods can be used for determining thequality of the DNA or RNA library pooling.

In some embodiments, the acceptable and/or target final DNAconcentration range for pooling is at least 0.5-4 nmol/l (e.g., 0.5-4nmol/l, 0.5-3 nmol/l, 2-4 nmol/l). In some embodiments, the acceptableDNA concentration for pooling is at least 0.1 ng/μl (e.g., 0.1 ng/μl,0.8 ng/μl, 4 ng/μl) when a fluorometer, for example for quantificationof DNA or RNA (e.g., a Qubit fluorometer available from ThermoFisherScientific, www.thermofisher.com) is used. In some embodiments, thetarget DNA concentration for pooling is at least 0.1 ng/μl (e.g., 0.1ng/μl, 0.8 ng/μl, 4 ng/μl) when a fluorometer, for example forquantification of DNA or RNA (e.g., a Qubit fluorometer available fromThermoFisher Scientific, www.thermofisher.com) is used.

In some embodiments, the acceptable DNA concentration for pooling is atleast 0.1 ng/μl (e.g., 0.1 ng/μl, 0.8 ng/μl, 4 ng/μl) when anelectrophoresis device, for example an automated electrophoresis device(e.g., a TapeStation System available from Agilent, www.agilent.com) isused. In some embodiments, the target DNA concentration for pooling isat least 0.1 ng/μl (e.g., 0.1 ng/μl, 0.8 ng/μl, 4 ng/μl) when anelectrophoresis device, for example an automated electrophoresis device(e.g., a TapeStation System available from Agilent, www.agilent.com) isused. In some embodiments, the acceptable DNA concentration for poolingis at least 0.5 nmol/l (e.g., 0.5 nmol/l, 0.8 nmol/l, 3 nmol/l) when anelectrophoresis device, for example an automated electrophoresis device(e.g., a TapeStation System available from Agilent, www.agilent.com) isused. In some embodiments, the target DNA concentration for pooling isat least 0.5 nmol/l (e.g., 0.5 nmol/l, 0.8 nmol/l, 3 nmol/l) when anelectrophoresis device, for example an automated electrophoresis device(e.g., a TapeStation System available from Agilent, www.agilent.com) isused. In some embodiments, the acceptable and/or concentration of DNA isin the range of 380-440 ng (e.g., 380-440 ng, 400-440 ng, 420-440 ng)when an electrophoresis device, for example an automated electrophoresisdevice (e.g., a TapeStation System available from Agilent,www.agilent.com) is used. In some embodiments, the acceptable DNAconcentration for pooling is at least 0.5 nmol/l (e.g., 0.5 nmol/l, 0.8nmol/l, 3 nmol/l) when a nucleic acid amplification device (e.g., a PCRsystem), for example a real-time PCR system (e.g., a LightCyclerInstrument available from Roche, www.lifescience.roche.com) is used. Insome embodiments, the target DNA concentration for pooling is at least0.5 nmol/l (e.g., 0.5 nmol/l, 0.8 nmol/l, 3 nmol/l) when LightCycler isused.

In some embodiments, the quality control of nucleic acid comprisesensuring the quality after DNA or RNA library construction such asduring the sequencing process. In some embodiments, cluster density canbe a parameter for quality control of the sample run (Example 6).Cluster density is an important factor in optimizing data quality andyield of the sequencing. Without wishing to be bound by any theory, anoptimal cluster density at least shows the DNA or RNA libraries arebalanced. In some embodiments, quality score and signal/noise ratio canbe parameters for quality control of the sample run.

In some embodiments, the quality control of nucleic acid comprisesensuring the quality of sequencing. In some embodiments, the qualitycontrol of sequencing comprises bioinformatics quality control. In someembodiments, the sequencing can be a DNA sequencing. In someembodiments, the sequencing can be RNA sequencing. In some embodiments,the sequencing can be any type of sequencing technologies known in theart for determining the DNA or RNA expression profiles of a givenbiological sample. By way of example, the sequencing can be awhole-exome sequencing. The sequencing can be a transcriptomesequencing. The sequencing can be a Sanger sequencing.

In some embodiments, up to 1 ng (e.g., up to 0.1, up to 0.2, up to 0.3,up to 0.4, up to 0.5, up to 0.6, up to 0.7, up to 0.8, up to 0.9, or upto 1 ng) of library in up to 2 μl (e.g., up to 0.1 μl, up to 0.5 μl, upto 0.8 μl, up to 0.9 μl, up to 1 μl, up to 1.2 μl, up to 1.4 μl, up to1.5 μl, up to 1.8 μl, or up to 2 μl) of solution is used for qualitycontrol testing. In some embodiments, the parameters that are testedinclude sizes and size distribution of DNA molecules, and purity.

In some embodiments, a standard method of preparing a library of cDNAfragments from RNA fails to preserve the information pertaining to whichDNA strand was the original template during transcription and subsequentsynthesis of the mRNA transcript. Since antisense transcripts are likelyto have regulatory roles that are distinctly different from theirprotein coding complement, this loss of strand information results in anincomplete understanding of the transcriptome. Strand-specific RNA-Seqcan be performed to preserve this strandedness. Methods of preservingstrandedness and preparing cDNA fragment libraries for it are known inthe art (see e.g., Mills et al. Strand-Specific RNA-Seq Provides GreaterResolution of Transcriptome Profiling; Curr Genomics. 2013 May; 14(3):173-181). In some embodiments, library preparation for stranded RNA-seqmakes use of known orientation strand-specific adapters. In someembodiments, strands are chemically modified to preserve knowledge oftheir origin.

In some embodiments, methods making use of adapters includestrand-specific 3′-end RNA-Seq. In some embodiments, strand-specific3′-end RNA-Seq comprises anchored oligo(dT) primers are first used toselect for mRNA, which results in production of double-stranded cDNAmolecules. Adapters for paired-end sequencing are then ligated to eachend of the cDNA molecule. Subsequently, the fragments are sequencedgenerating pair-end reads that are aligned to a reference genome. Anyaligned read that contains a stretch of adenines at the end of thetranscript must be a transcript that originated from the DNA antisensestrand, while any reads that align with a stretch of thymines at thefront must be a transcript from the DNA sense strand.

In some embodiments, methods making use of adapters makes use ofsingle-stranded (ss) cDNA and Illumina adapters and 4 DNA ligase thatallows for linking of 3′ and 5′ adapters to ssDNA. As the second strandis never synthesized and does not proceed to sequencing, strandinformation is retained.

In some embodiments, any suitable technologies or tools can be used forpreserving strandedness. For example, Flowcell reverse transcriptionsequencing (FRT-Seq) can be used to preserve strandedness. In someembodiments, FRT-Seq or the equivalent technologies comprises ligationof adapters to either end of fragmented and purified polyadenylatedmRNA. In some embodiments, each adapter comprises two regions; a regionto which the sequencing primers anneal and a region that iscomplementary to the oligonucleotides present on the flowcell. Thecomplementary region allows the mRNA fragment to hybridize to theflowcell. The mRNA fragments are then reverse transcribed on theflowcell surface.

Other non-limiting adapter-based methods of preserving strandednessinclude direct strand-specific sequencing (DSSS) and SOLiD® TotalRNA-Seq Kit (tools.thermofisher.com/content/sfs/manuals/cros_078610.pdf)that preserves strand specificity through the addition of adapters in adirectional manner.

In some embodiments, chemical modification of strands to preserveknowledge of their origin comprises marking the original RNA templatethrough the use of bisulfite treatment. In some embodiments, dUTPs areincorporated into reverse transcription reaction, resulting in ds cDNAwhere the original strand has deoxythymidine residues while thecomplementary strand contains deoxyuridine residues.Uracil-DNA-Glycosylase (UDG) treatment can then be used to degradecomplementary strands.

Library Preparation of WES

An “exome” is the sum of all regions in the genome comprised of exons.Exons are DNA regions that are transcribed into messenger RNA, asopposed to introns which are removed by splicing proteins. Exomesequencing is a capture-based method developed to identify variants inthe coding region of genes that affect protein function. As the codingportion of the genome encompasses only 1-2% of the entire genome, thisapproach represents a cost-effective strategy to detect DNA alterationsthat may alter protein function, compared to whole genome sequencing. Insome embodiments, whole exome sequencing (WES) comprises preparation ofa library of DNA fragments for sequencing from a sample of DNA. In someembodiments, DNA is first fragmented to the appropriate size (dependingon the sequencing platform used) and then sequencing platform-specificadapters are added. In some embodiments, libraries are amplified beforethe next step in the process (target enrichment or sequencing).

Kits are commercially available for the preparation of libraries,non-limiting examples of which include KAPA HyperPrep Kits, AgilentHaloPlex, Agilent SureSelect QXT, IDT xGEN Exome, Illumina Nextera RapidCapture Exome, Roche Nimblegen SeqCap, and MYcroarray MYbaits. In someembodiments, any kit that can prepare a DNA library for WES can be used.For example, an Agilent Human All Exon V6 Capture Kit(www.agilent.com/es/library/datasheets/public/SureSelect%20V6%20DataSheet%205991-5572EN.pdf)is used to prepare a DNA library for WES. In some embodiments, aClinical Research Exome kit(www.agilent.com/en/promotions/clinical-research-exome-v2) is used.Quantities of DNA needed depend on the specific reagents used to preparethe library. For example, 100 ng of genomic DNA is sufficient forAgilent SureSelect XT2 V6 Exome, but 500 ng of genomic DNA is requiredfor IDT xGEN Exome Panel. A comparison of various capture kits areprovided in www.genohub.com/exome-sequencing-library-preparation/.

In some embodiments, a library preparation method comprises one or moreamplification steps to add function elements (e.g., sample indices,molecular barcodes or flow cell oligo binding sites), enrich forsequencing-competent DNA fragments, and/or generate a sufficient amountof library DNA for downstream processing. By way of example, a librarypreparation method is shown in Example 3 and Example 5.

In some embodiments, prepared DNA libraries are tested for quality. Insome embodiments, quantification of libraries for use in sequencing isgenerally performed before the libraries are pooled for targetenrichment or amplification to ensure equal representation of indexedlibraries in multiplexed applications. In some embodiments,quantification is also used to confirm that individual libraries orlibrary pools are diluted optimally prior to sequencing. Accurate andreproducible quantification of adapter-ligated library moleculescontributes to obtaining consistent and reproducible results, and formaximizing sequencing yields. Loading more than the recommended amountof DNA could lead to saturation of the flowcell or increased clusterdensity while loading too little DNA can lead to decreased clusterdensity and reduced coverage and depth.

Methods of quantifying DNA libraries include electrophoresis,fluorometry, spectrophotometry, digital PCR, droplet-digital PCR andqPCR. Various instruments for measuring the quantity and/or quality ofDNA libraries exist, e.g., the Agilent High Sensitivity D1000 ScreenTapeSystem.

In some embodiments, prepared DNA libraries are tested for quality. Insome embodiments up to 1 ng (e.g., up to 0.1, up to 0.2, up to 0.3, upto 0.4, up to 0.5, up to 0.6, up to 0.7, up to 0.8, up to 0.9, or up to1 ng) of library in up to 2 μl (e.g., up to 0.1 μl, up to 0.5 μl, up to0.8 μl, up to 0.9 μl, up to 1 μl, up to 1.2 μl, up to 1.4 μl, up to 1.5μl, up to 1.8 μl, or up to 2 μl) of solution is used for quality controltesting. In some embodiments, the parameters that are tested includesizes and size distribution of DNA molecules, and purity.

RNA Sequencing

RNA sequencing is a tool to measure the transcriptome. The transcriptomeis comprised of different populations of RNA molecules, including mRNA,rRNA, tRNA, and other non-coding RNA (such as microRNA, lncRNA). In someembodiments, RNA sequencing is used to profile the transcriptome (e.g.,the coding and/or non-coding regions). In some embodiments, it is usedto identify genes that are differentially expressed in differentbiological samples (e.g., cells, tissue, or bodily fluid). In someembodiments, RNA sequencing is used to determine the genetic effects ofsplicing events, identify novel transcripts, detect structuralvariations (e.g., gene fusions and isoforms), and/or to detect singlenucleotide variants.

In some embodiments, the term “RNA sequencing” can be usedinterchangeably with “RNA seq,” “RNA-seq,” or the variations thereof asknown in the art referring to any technologies, tools, or platforms thatinterrogate the transcriptome. It is noted that when “RNA sequencing,”“RNA seq,” “RNA-seq,” or the variations thereof is referred in thepresent disclosure, it does not refer to a specific technology or toolthat is associated with a particular platform or company, unlessindicated otherwise by way of non-limiting examples for demonstratingthe processes or systems as described herein. In some embodiments, RNAsequencing can be conducted by using any suitable sequencing platformsand/or sequencing methods. Non-limiting examples of high-throughputsequencing platforms include mRNA-seq, total RNA-seq, targeted RNA-seq,single-cell RNA-Seq, RNA exome capture platform, or small RNA-seq (e.g.,Illumina, www.illumina.com), SMRT (single molecule, real-time)sequencing (e.g., Pacific Biosciences, https://www.pacb.com), and RNAsequencing (e.g., ThermoFisher, https://www.thermofisher.com).

As described above, RNA sequencing can be targeted or untargeted.Targeted approaches include using sequence-specific probes oroligonucleotides to sequence one or more specific regions of thetranscriptome. In some embodiments, targeted RNA sequencing includesmethods such as mRNA enrichment (e.g., by polyA enrichment or rRNAdepletion).

In some embodiments, RNA sequencing is whole transcriptome sequencing.Whole transcriptome sequencing comprises measurement of the completecomplement of transcripts in a sample. In some embodiments, wholetranscriptome sequencing is used to determine global expression levelsof each transcript (e.g., both coding and non-coding), identify exons,introns and/or their junctions.

In some embodiments, RNA is sequenced directly without preparing cDNAfrom a sample of RNA. In some embodiments, direct RNA sequencingcomprises single molecule RNA sequencing (DRSTM).

In some embodiments, RNA sequencing is mRNA sequencing. In someembodiments, mRNA sequencing is the sequencing of only codingtranscripts with the goal to exclude non-coding regions. In someembodiments, mRNA sequencing is independent of polyA enrichment. In someembodiments, mRNA sequencing depends on polyA enrichment.

In some embodiments, RNA is extracted from a biological sample, mRNA isenriched from the extracted RNA, cDNA libraries are constructed from theenriched mRNA. In some embodiments, single pieces of cDNA from a cDNAlibrary are attached to a solid matrix. In some embodiments, singlepieces of cDNA from a cDNA library are attached to a solid matrix bylimited dilution. In some embodiments, cDNA pieces attached to a matrixare then sequenced (e.g., using Pacbio or Pacifbio technology). In someembodiments, cDNA pieces that are attached to a matrix are amplified andsequenced (e.g., using a specialized emulsion PCR (emPCR) in SOLiD, 454Pyrosequencing, Ion Torrent, or a connector based on the bridgingreaction (Illumina) platforms).

In some embodiments, cDNA transcripts can be sequenced in parallel,either by measuring the incorporation of fluorescent nucleotides (forexample, Illumina), fluorescent short linkers (for example, SOLiD), bythe release of the by-products derived from the incorporation of normalnucleotides (454), by measuring fluorescence emissions, or by measuringpH change (for example, Ion Torrent). In some embodiments, cDNAtranscripts can be sequenced using any known sequencing platform.Jazayeri et al. (RNA-seq: a glance at technologies and methodologies;Acta biol. Colomb. vol. 20 no. 2 Bogotá May/August 2015) provides acomparison of different RNA-seq platforms, and is incorporated herein byreference in its entirety, including Table 3 and Table 4. Mestan et al.(Genomic sequencing in clinical trials; Journal of TranslationalMedicine 2011, 9:222) provides a similar analysis for sequencing inclinical trials.

In some embodiments, RNA sequencing is stranded or strand-specific. cDNAsynthesis from RNA results in loss of strandedness. In some embodiments,strandedness is preserved by chemically labeling either or both the RNAstrand and the cDNA strand that is formed by reverse transcription orantisense transcription, or by using adapter-based techniques todistinguish the original RNA strand from the complementary DNA strand,as described above.

In some embodiments, nonstranded RNA sequencing is performed. In someembodiments, stranded RNA-seq should be avoided for clinical samples. Insome embodiments, nonstranded RNA-seq is used to compare data obtainedfrom a biological sample to RNA sequencing data in established data sets(e.g., The Cancer Genome Atlas (TCGA) and International Cancer GenomeConsortium (ICGC)).

In some embodiments, RNA sequencing yields paired-end reads. Paired-endreads are reads of the same nucleic acid fragment and are reads thatstart from either end of the fragment. In some embodiments, RNAsequencing is performed with paired-end reads of at least 2×25 (2×25,2×50, 2×75, 2×100, 2×125, 2×150, 2×175, 2×200, 2×225, 2×250, 2×275,2×300, 2×325, or 2×350) paired-end reads. In some embodiments, RNAsequencing is performed with paired-end reads of at least 2×75paired-end reads. RNA sequencing with 2×75 paired-end reads means thaton average each read, which is paired-end, reads 75 base pairs. In someembodiments, RNA sequencing is performed with a total of at least 20million (e.g., at least 20 million, at least 30 million, at least 40million, at least 50 million, at least 60 million, at least 70 millionat least 80 million, at least 90 million, at least 100 million, at least120 million, at least 140 million, at least 150 million, at least 160million, at least 180 million, at least 200 million, at least 250million, at least 300 million, at least 350 million, or at least 400million) paired-end reads. In some embodiments, RNA sequencing isperformed with a total of at least 50 million paired-end reads. In someembodiments, RNA sequencing is performed with a total of at least 100million paired-end reads.

In some embodiments, quality control is performed for RNA sequencing. Insome embodiments, cluster density or cluster PF % is a parameter fordetermining the quality of the sample run. In some embodiments, thetarget range of cluster density or cluster PF % is at least 170-220(e.g., 170-220, 190-220, 210-220). In some embodiments, the acceptablerange of cluster density or cluster PF % is at least 280 (e.g., 280,300, 450).

In some embodiments, %≥Q30 is a parameter for determining the quality ofthe sample run. In some embodiments, the target %≥Q30 is at least 85%(e.g., 85%, 90%, 95%). In some embodiments, the acceptable %≥Q30 is atleast 75% (e.g., 75%, 85%, 95%).

In some embodiments, error rate % is a parameter for determining thequality of the sample run. In some embodiments, the target error rate %is less than 0.7% (e.g., 0.6%, 0.5%, 0.4%). In some embodiments, theacceptable error rate % is less than 1% (e.g., 0.9%, 0.8%, 0.7%).

Whole Exome Sequencing (WES)

Whole exome sequencing (WES) is a genomic technique for sequencing allof the protein-coding region of genes in a genome. In some embodiments,WES is performed to identify genetic variants that alter proteinsequences. In some embodiments, WES is performed to identify geneticvariants that alter protein sequences at a cost that is lower than thecost of whole genome sequencing.

In some embodiments, whole exome sequencing (WES) is performed on asample of DNA that has been extracted from a biological sample. In someembodiments, a library of DNA fragments is prepared from the sample ofextracted DNA. In some embodiments, any one of the methods describedherein comprises performing whole exome sequencing (WES) on a library ofDNA fragments. Preparation of DNA libraries from a sample of DNA for WESis described above.

In some embodiments, libraries of DNA are quantified before sequencing(e.g., using next-generation sequencing (NGS)). In some embodiments, DNAlibraries are pooled before sequencing. In some embodiments, DNAlibraries are amplified before sequencing. In some embodiments, DNAlibraries are indexed before sequencing to keep track of the origin of aDNA fragment.

In some embodiments, WES comprises target-enrichment allowing theselective capture of genomic regions of interest prior to sequencing. Insome embodiments, array-based capture is used (e.g., using microarrays).In some embodiments, in-solution capture is used.

Any high-throughput DNA sequencing platform and/or method can be used inany one of the methods described herein. In some embodiments, DNAsequencing can be conducted by using any suitable platforms and/ormethods. Non-limiting examples of high-throughput sequencing methodsinclude Single-molecule real-time sequencing, Ion semiconductor (IonTorrent sequencing), Pyrosequencing (i.e., 454), Sequencing by synthesis(Illumina), Illumina (Solexa) sequencing, Combinatorial probe anchorsynthesis (cPAS-BGI/MGI), Sequencing by ligation (SOLiD sequencing),Nanopore Sequencing (e.g., using an instrument from Oxford NanoporeTechnologies, Chain termination (Sanger sequencing), massively parallelsignature sequencing (MPSS) pology sequencing, Heliscope single moleculesequencing, and Single molecule real time (SMRT) sequencing (e.g., usingan instrument from Pacific Biosciences). Other non-limiting examples ofhigh-throughput sequencing techniques include Tunnelling currents DNAsequencing, sequencing by hybridization, Sequencing with massspectrometry, Microfluidic Sanger sequencing, and RNAP sequencing.

In some embodiments, DNA sequencing yields paired-end reads. Paired-endreads are reads of the same nucleic acid fragment and are reads thatstart from either end of the fragment. In some embodiments, DNAsequencing is performed with paired-end reads of at least 2×25 (2×25,2×50, 2×75, 2×100, 2×125, 2×150, 2×175, 2×200, 2×225, 2×250, 2×275,2×300, 2×325, or 2×350) paired-end reads. In some embodiments, DNAsequencing is performed with paired-end reads of at least 2×75paired-end reads. DNA sequencing with 2×75 paired-end reads means thaton average each read, which is paired-end, reads 75 base pairs. In someembodiments, DNA sequencing is performed with a total of at least 20million (e.g., at least 20 million, at least 30 million, at least 40million, at least 50 million, at least 60 million, at least 70 millionat least 80 million, at least 90 million, at least 100 million, at least120 million, at least 140 million, at least 150 million, at least 160million, at least 180 million, at least 200 million, at least 250million, at least 300 million, at least 350 million, or at least 400million) paired-end reads. In some embodiments, DNA sequencing isperformed with a total of at least 50 million paired-end reads. In someembodiments, DNA sequencing is performed with a total of at least 100million paired-end reads. In some embodiments, DNA sequencing isperformed so that at least a 20× (e.g., at least a 20×, at least a 30×,at least a 40×, at least a 50×, at least a 60×, at least a 70×, at leasta 80×, at least a 90×, at least a 1000×, at least a 120×, at least a125×, at least a 150×, at least a 175×, at least a 200×, at least a250×, at least a 300×, or at least a 400×) coverage is yielded.Coverage, which also is referred to as depth, is the number of times, onaverage, a single base pair in a sample of nucleic acid is read orsequenced. In some embodiments, the portion of the genome that istargeted for capture and sequencing is at least 10 Mb (e.g., at least 10Mb, at least 20 Mb, at least 30 Mb, at least 40 Mb, at least 50 Mb, atleast 60 Mb, at least 70 Mb, at least 80 Mb, at least 90 Mb, at least100 Mb, at least 120 Mb, at least 150 Mb, at least 200 Mb, at least 250Mb, at least 300 Mb, or at least 350 Mb). In some embodiments, theportion of the genome that is targeted for capture and sequencing is atleast 48 Mb (e.g., after using the Agilent Human All Exon V6 Capturesystem). In some embodiments, the portion of the genome that is targetedfor capture and sequencing is at least 54 Mb (e.g., after using theClinical Research Exome capture system (Agilent)).

In some embodiments, quality control is performed for whole-exomesequencing. In some embodiments, cluster density or cluster PF % is aparameter for determining the quality of the sample run. In someembodiments, the target range of cluster density or cluster PF % is atleast 170-220 (e.g., 170-220, 190-220, 210-220). In some embodiments,the acceptable range of cluster density or cluster PF % is at least 280(e.g., 280, 300, 450).

In some embodiments, actual yield is a parameter for determining thequality of the sample run. In some embodiments, the target actual yieldis at least 15 Gbp (e.g., 15 Gbp, 20 Gbp, 30, Gbp).

In some embodiments, %≥Q30 is a parameter for determining the quality ofthe sample run. In some embodiments, the target %≥Q30 is at least 85%(e.g., 85%, 90%, 95%). In some embodiments, the acceptable %≥Q30 is atleast 75% (e.g., 75%, 85%, 95%).

In some embodiments, error rate % is a parameter for determining thequality of the sample run. In some embodiments, the target error rate %is less than 0.7% (e.g., 0.6%, 0.5%, 0.4%). In some embodiments, theacceptable error rate % is less than 1% (e.g., 0.9%, 0.8%, 0.7%).

Reagents and Kits

Contemplated herein are reagents and kits comprising reagents forperforming any one of the methods described herein. In some embodiments,a kit as provided herein comprises reagents (e.g., buffers,preservatives, inhibitors, or enzymes) and/or labware (e.g., pipettes,filters, tubes, storage containers such as vacutainers, or dissectiontools) for storing biological samples obtained from a subject.

In some embodiments, a kit as provided herein comprises reagents (e.g.,buffers, preservatives, inhibitors or enzymes) and/or labware (e.g.,pipettes, filters, or tubes) for extracting RNA and/or DNA from abiological sample or a sample derived from a biological sample (e.g., asingle-cells solution). In some embodiments, a kit as provided hereincomprises reagents (e.g., buffers, preservatives, inhibitors, enzymes ordyes) and/or labware (e.g., pipettes, filters, tubes, storagecontainers, or electrophoresis paper) for measuring the quality andquantity of RNA and/or DNA extracted from a biological sample. In someembodiments, a kit as provided herein comprises reagents (e.g., buffers,preservatives, inhibitors, enzymes or dyes) and/or labware (e.g.,pipettes, filters, tubes, storage containers, or electrophoresis paper)for measuring the quality and quantity of DNA libraries for sequencing(e.g., RNA-seq or WES).

In some embodiments, a kit as provided herein comprises reagents (e.g.,buffers, preservatives, inhibitors or enzymes) and/or labware (e.g.,pipettes, filters, tubes, storage containers such as vacutainers, ordissection tools) for preparing a single-cell solution from a biologicalsample.

In some embodiments, a kit as provided herein comprises reagents (e.g.,buffers, inhibitors, or enzymes such as reverse transcriptase enzyme)and/or labware (e.g., pipettes, filters, tubes, storage containers) forpreparing DNA libraries for sequencing.

In some embodiments, a kit as provided herein comprises reagents (e.g.,buffers, preservatives, inhibitors or enzymes) and/or labware (e.g.,pipettes, filters, tubes, storage containers such as vacutainers, ordissection tools) for any combination of two or more of the following:storing biological samples, extracting RNA and/or DNA from a biologicalsample, testing the quality and quantity of extracted RNA and/or DNAsamples and/or DNA libraries prepared therefrom, preparing single-cellsolutions from a biological sample, and preparing DNA libraries fromextracted RNA and/or DNA.

In some embodiments, any one of the kits described herein comprisescomponents for making a cell-dissociation cocktail. A cell-dissociationcocktail may be enzymatic or non-enzymatic. In some embodiments, a kitcomprises one or more enzyme cocktails. In some embodiments, a kitcomprises any one or more of the following components: media (e.g., L-15media), antibacterials (e.g., penicillin and/or streptomycin),anti-fungals (e.g., amphoterecin), collagenase (e.g., collagenase I,collagenase II, collagenase IV), DNAse (e.g., DNAse I), elastase,hyaluronidase, and proteases (e.g., protease XIV, trypsin, papain, ortermolysin). In some embodiments, any one of the kits described hereincomprises one or more of the following enzymes: collagenase I andCollagenase IV. In some embodiments, these enzymes are comprised inseparate containers. In some embodiments, these enzymes are comprised ina single container.

In some embodiments, a kit comprises a smaller equipment such as aspectrophotometer. In some embodiments, a kit comprises instructions forperforming any one of, or a combination of any two or more of thefollowing: storing biological samples, extracting RNA and/or DNA from abiological sample, testing the quality and quantity of extracted RNAand/or DNA samples and/or DNA libraries prepared therefrom, preparingsingle-cell solutions from a biological sample, and preparing DNAlibraries from extracted RNA and/or DNA. In some embodiments, a kitcomprises instructions for performing any one of the methods describedherein. In some embodiments, a kit is fashioned or tailored for specifictissue types, e.g., biopsies of a solid tumor, a liquid biopsy, a bloodsample, or urine.

Data Processing

Aspects of this disclosure relate to processing data obtained from RNAsequencing. In some embodiments, a method to process RNA expression data(e.g., data obtained from RNA sequencing (also referred to herein asRNA-seq data)) comprises aligning and annotating genes in RNA expressiondata with known sequences of the human genome to obtain annotated RNAexpression data; removing non-coding transcripts from the annotated RNAexpression data; converting the annotated RNA expression data to geneexpression data in transcripts per kilobase million (TPM) format;identifying at least one gene that introduces bias in the geneexpression data; and removing the at least one gene from the geneexpression data to obtain bias-corrected gene expression data. In someembodiments, a method to process RNA expression data comprises obtainingRNA expression data for a subject having or suspected of having cancer.

In some embodiments, non-coding transcripts may comprise genes thatbelong to groups selected from the list consisting of: pseudogenes,polymorphic pseudogenes, processed pseudogenes, transcribed processedpseudogenes, unitary pseudogenes, unprocessed pseudogenes, transcribedunitary pseudogenes, constant chain immunoglobulin (IG C) pseudogenes,joining chain immunoglobulin (IG J) pseudogenes, variable chainimmunoglobulin (IG V) pseudogenes, transcribed unprocessed pseudogenes,translated unprocessed pseudogenes, joining chain T cell receptor (TR J)pseudogenes, variable chain T cell receptor (TR V) pseudogenes, smallnuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), microRNAs (miRNA),ribozymes, ribosomal RNA (rRNA), mitochondrial tRNAs (Mt tRNA),mitochondrial rRNAs (Mt rRNA), small Cajal body-specific RNAs (scaRNA),retained introns, sense intronic RNA, sense overlapping RNA,nonsense-mediated decay RNA, non-stop decay RNA, antisense RNA, longintervening noncoding RNAs (lincRNA), macro long non-coding RNA (macrolncRNA), processed transcripts, 3prime overlapping non-coding RNA(3prime overlapping ncrna), small RNAs (sRNA), miscellaneous RNA (miscRNA), vault RNA (vaultRNA), and TEC RNA.

In some embodiments, information (e.g., sequence information) for one ormore transcripts for one of more of these types of transcripts can beobtained in a nucleic acid database (e.g., a Gencode database, forexample Gencode V23, Genbank database, EMBL database, or otherdatabase).

In some embodiments, a method to process RNA expression data (e.g., dataobtained from RNA sequencing (also referred to herein as RNA-seq data))comprises identifying a cancer treatment (also referred to herein as ananti-cancer therapy) for the subject using the bias-corrected geneexpression data. In some embodiments, any one of the methods ofprocessing RNA expression data is further combined with administering toa subject one or more anti-cancer therapy or cancer treatment. In someembodiments, any one of the methods of processing RNA expression data isfurther combined with directing or recommending the administering to asubject one or more anti-cancer therapy or cancer treatment.

Obtaining RNA Expression Data

In some embodiments, a method to process RNA expression data (e.g., dataobtained from RNA sequencing (also referred to herein as RNA-seq data))comprises obtaining RNA expression data for a subject (e.g., a subjectwho has or has been diagnosed with a cancer). In some embodiments,obtaining RNA expression data comprises obtaining a biological sampleand processing it to perform RNA sequencing using any one of the RNAsequencing methods described herein. In some embodiments, RNA expressiondata is obtained from a lab or center that has performed experiments toobtain RNA expression data (e.g., a lab or center that has performedRNA-seq). In some embodiments, a lab or center is a medical lab orcenter.

In some embodiments, RNA expression data is obtained by obtaining acomputer storage medium (e.g., a data storage drive) on which the dataexists. In some embodiments, RNA expression data is obtained via asecured server (e.g., a SFTP server, or Illumina BaseSpace). In someembodiments, data is obtained in the form of a text-based filed (e.g., aFASTQ file). In some embodiments, a file in which sequencing data isstored also contains quality scores of the sequencing data). In someembodiments, a file in which sequencing data is stored also containssequence identifier information.

Alignment and Annotation

In some embodiments, a method to process RNA expression data (e.g., dataobtained from RNA sequencing (also referred to herein as RNA-seq data))comprises aligning and annotating genes in the RNA expression data withknown sequences of the human genome to obtain annotated RNA expressiondata.

In some embodiments, alignment of RNA expression data comprises aligningthe data to a known assembled genome for a particular species of subject(e.g., the genome of a human) or to a transcriptome database. Varioussequence alignment software are available and can be used to align datato an assembled genome or a transcriptome database. Non-limitingexamples of alignment software includes short (unspliced) aligners(e.g., BLAT; BFAST, Bowtie, Burrows-Wheeler Aligner, ShortOligonucleotide Analysis package, or Mosaik), spliced aligners, alignersbased on known splice junctions (e.g., Errange, IsoformEx, or SpliceSeq), or de novo splice aligner (e.g., ABMapper, BBMap, CRAC, or HiSAT).In some embodiments, any suitable tool can be used for aligning andannotating data. For example, Kallisto (github.com/patcherlab/kallisto)is used to align and annotate data. In some embodiments, a known genomeis referred to as a reference genome. A reference genome (also known asa reference assembly) is a digital nucleic acid sequence database,assembled as a representative example of a species' set of genes. Insome embodiments, human and mouse reference genomes used in any one ofthe methods described herein are maintained and improved by the GenomeReference Consortium (GRC). Non-limiting examples of human referencereleases are GRCh38, GRCh37, NCBI Build 36.1, NCBI Build 35, and NCBIBuild 34. A non-limiting example of transcriptome databased includeTranscriptome Shotgun Assembly (TSA).

In some embodiments, annotating RNA expression data comprisesidentifying the locations of genes and/or coding regions in the data tobe processed by comparing it to assembled genomes or transcriptomedatabases. Non-limiting examples of data sources for annotation includeGENCODE (www.gencodegenes.org), RefSeq (see e.g.,www.ncbi.nlm.nih.gov/refseq/), and Ensembl. In some embodiments,annotating genes in RNA expression data is based on a GENCODE database(e.g., GENCODE V23 annotation; www.gencodegenes.org).

Consea et al. (A survey of best practices for RNA-seq data analysis;Genome Biology201617:13) provides best practices for analyzing RNA-seqdata, which are applicable to any one of the methods described hereinand is incorporated herein by reference in its entirety. Pereira andRueda(bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/Day2/rnaSeq_align.pdf)also describe methods for analyzing RNA sequencing data, which areapplicable to any one of the methods described herein, and isincorporated herein by reference in its entirety.

Removing Non-Coding Transcripts

In some embodiments, a method to process RNA expression data (e.g., dataobtained from RNA sequencing (also referred to herein as RNA-seq data))comprises removing non-coding transcripts from annotated RNA expressiondata. Aligning and annotating RNA expression data allows identificationof coding and non-coding reads. In some embodiments, non-coding readsfor transcripts are removed so as to concentrate analysis effort onexpression of proteins (e.g., those that may be involved in pathology ofcancer). In some embodiments, removing reads for non-coding transcriptsfrom the data reduces the variance in the data, e.g., in replicates ofthe same or similar sample (e.g., nucleic acid from the same cells orcell-type). In some embodiments, non-limiting examples of expressiondata that is removed include one or more non-coding transcripts (e.g.,10-50, 50-100, 100-1,000, 1,000-2,500, 2,500-5,000 or more non-codingtranscripts) that belong to one or more gene groups selected from thelist consisting of: pseudogenes, polymorphic pseudogenes, processedpseudogenes, transcribed processed pseudogenes, unitary pseudogenes,unprocessed pseudogenes, transcribed unitary pseudogenes, constant chainimmunoglobulin (IG C) pseudogenes, joining chain immunoglobulin (IG J)pseudogenes, variable chain immunoglobulin (IG V) pseudogenes,transcribed unprocessed pseudogenes, translated unprocessed pseudogenes,joining chain T cell receptor (TR J) pseudogenes, variable chain T cellreceptor (TR V) pseudogenes, small nuclear RNAs (snRNA), small nucleolarRNAs (snoRNA), microRNAs (miRNA), ribozymes, ribosomal RNA (rRNA),mitochondrial tRNAs (Mt tRNA), mitochondrial rRNAs (Mt rRNA), smallCajal body-specific RNAs (scaRNA), retained introns, sense intronic RNA,sense overlapping RNA, nonsense-mediated decay RNA, non-stop decay RNA,antisense RNA, long intervening noncoding RNAs (lincRNA), macro longnon-coding RNA (macro lncRNA), processed transcripts, 3prime overlappingnon-coding RNA (3prime overlapping ncrna), small RNAs (sRNA),miscellaneous RNA (misc RNA), vault RNA (vaultRNA), and TEC RNA.

In some embodiments, information (e.g., sequence information) for one ormore transcripts for one of more of these types of transcripts can beobtained in a nucleic acid database (e.g., a Gencode database, forexample Gencode V23, Genbank database, EMBL database, or otherdatabase). In some embodiments, a fraction (e.g., 10%, 20% 30%, 40%,50%, 60%, 70%, 80%, 90%, 95%, 98%, 99%, or 99.5% or more) of thenon-coding transcripts, histone-encoding gene, mitochondrial genes,interleukin-encoding genes, collagen-encoding genes, and/or T cellreceptor-encoding genes as described herein are removed from aligned andannotated RNA expression data.

Conversion to TPM and Gene Aggregation

In some embodiments, a method to process RNA expression data (e.g., dataobtained from RNA sequencing (also referred to herein as RNA-seq data))comprises normalizing RNA expression data per length of transcript(e.g., to transcripts per kilobase million (TPM) format) that is read.In some embodiments, RNA expression data that is normalized per lengthof transcript is first aligned and annotated. Conversion of data to TPMallows presentation of expression in the form of concentration, ratherthan counts, which in turn allows comparison of samples with differenttotal read counts and/or length of reads.

In some embodiments, RNA expression data that is normalized per lengthof transcript read is then analyzed to obtain gene expression data(expression data per gene). This is also referred to as geneaggregation. Gene aggregation comprises combining expression data inreads for transcripts for all isoforms of a gene to obtain expressiondata for that gene. In some embodiments, gene aggregation to obtain geneexpression data is performed after TPM normalization but beforeidentifying genes that introduce bias. In some embodiments, geneaggregation is performed before conversion of the data to TPM.

Wagner et al (Theory Biosci. (2012) 131:281-285) provides an explanationof how TPM can be calculated and is incorporated herein by reference inits entirety. In some embodiments, the following formula is used tocalculate TPM:

$A \cdot \frac{1}{\sum(A)} \cdot 10^{6}$${{Where}\mspace{14mu} A} = \frac{{total}\mspace{14mu} {reads}\mspace{14mu} {mapped}\mspace{14mu} {to}\mspace{14mu} {{gene} \cdot 10^{3}}}{{gene}\mspace{14mu} {length}\mspace{14mu} {in}\mspace{14mu} {bp}}$

Removing Bias

Since conversion of RNA expression data to obtain expression in TPMformat requires dividing the number of reads for a given transcript bythe length of a transcript read, biases may be introduced in the datafor various reasons (as described below). Accordingly, some embodimentsof any one of the methods described herein comprise identifying at leastone gene that introduces bias in the gene expression data. Someembodiments of any one of the methods described herein compriseidentifying at least one gene that introduces bias in the geneexpression data, and removing expression data for the at least one genefrom the gene expression data to obtain bias-corrected gene expressiondata.

In some embodiments, removing data from a dataset may involve deletingthe data from the dataset, marking the data so that it is not used insome or all subsequent processing of the dataset, and/or doing any othersuitable processing so that the data is not used in some or allsubsequent processing of the dataset. For example, removing particularexpression data (e.g., expression data for at least one gene introducingbias) from gene expression data may involve deleting the particularexpression data from the gene expression data, marking the particularexpression data and/or doing any other suitable processing so that theparticular expression data is not used in some or all subsequentprocessing of the gene expression data. As another example, removingnon-coding transcripts from the RNA expression data (as described above)may involve deleting the non-coding transcripts, marking the non-codingtranscripts, and/or doing any other suitable subsequent processing sothat the non-coding transcripts are not used in some or all subsequentprocessing of the RNA expression data. As yet another example, removingsequence data, determined to not pass one or more quality control checksduring performance of quality control techniques described herein, mayinvolve deleting the sequence data, marking the sequence data and/ordoing any other suitable processing so that the sequence data failingthe quality control check(s) is not used in some or all subsequentprocessing.

In some embodiments, biases in expression data converted to TPM formatare attributed to transcripts of an average length that is at least athreshold amount higher or lower than an average length of transcript asread in the entire expression data set. For example, a gene for whichone or more transcript of one or more isoforms has a length that is athreshold (e.g., at least 1 standard deviations, 2 standard deviations,3 standard deviations, 4 standard deviations, 5 standard deviations, 6standard deviations, 7 standard deviations, 8 standard deviations, 9standard deviations, 10 standard deviations, 11 standard deviations, 12standard deviations, 13 standard deviations 13 standard deviations, or15 standard deviations or more) lower from the mean or median transcriptlength in the entire expression data set, the expression of the gene inTPM format will artificially appear to be high. Conversely, if a genefor which one or more reads of one or more isoforms is of a length thatis a threshold (e.g., at least 1 standard deviations, 2 standarddeviations, 3 standard deviations, 4 standard deviations, 5 standarddeviations, 6 standard deviations, 7 standard deviations, 8 standarddeviations, 9 standard deviations, 10 standard deviations, 11 standarddeviations, 12 standard deviations, 13 standard deviations 13 standarddeviations, or 15 standard deviations or more) higher than the mean ormedian read length in the entire expression data set, the expression ofthe gene in TPM format will artificially appear to be low. In someembodiments, a threshold value is set in terms of standard deviations(e.g., at least 1 standard deviations, 2 standard deviations, 3 standarddeviations, 4 standard deviations, 5 standard deviations, 6 standarddeviations, 7 standard deviations, 8 standard deviations, 9 standarddeviations, 10 standard deviations, 11 standard deviations, 12 standarddeviations, 13 standard deviations 13 standard deviations, or 15standard deviations or more). In some embodiments, a threshold value isset based on a length of transcript and/or length of read, e.g., below 5bp, below 10 bp, below 15 bp, below 20 bp, below 25 bp, below 50 bp,below 75 bp, below 100 bp, or below 150 bp or more.

In some embodiments, biases are attributed to the lengths of polyA tailon a transcript. In some embodiments, RNA transcripts having a polyAtail that is on average smaller or higher than the average length ofpolyA tail for RNA transcripts in a sample are enriched more or lessthan the average enrichment of all RNA transcripts in a sample.Accordingly, a gene may be associated with a polyA tail that is at leasta threshold amount smaller in length compared to an average length ofpolyA tails of genes from a sample from which the RNA expression datawas obtained. In some embodiments, such expression data for such genesis also removed from the gene expression data to obtain bias-correctedgene expression data. Removing expression data associated with one ormore genes from a data set to reduce bias may be considered as a type offiltering of the data. In some embodiments, “filtration” may refer toany one or more of removing expression data for genes that appearartificially high or low (e.g., because of the lengths of transcripts,or the length of the polyA tails associated with transcripts), andremoving expression data of non-coding RNA from data.

In some embodiments, identifying at least one gene that introduces biasin the gene expression data comprises analyzing the length oftranscripts within the data set that is being analyzed. In someembodiments, removing, from the gene expression data, expression datafor at least one gene that introduces bias decreases variability andimproves the overall accuracy of subsequent gene expression-basedanalysis.

In some embodiments, identifying at least one gene that introduces biasin the gene expression data comprises use of knowledge gained fromanalyzing data outside of the expression data set in questions, e.g.,using reference data sets. The inventors recognized that removing(expression data for) genes having polyA tail length that is outside theaverage range of polyA tails in a RNA expression data set effectivelyremoves bias and/or outliers in the gene expression data. For example,knowledge that a certain family of genes introduces biases can be had apriori (from previously performed experiments or previously performedprocessing of data) to processing RNA expression data and can be used tofilter out data for that family of genes.

In some embodiments, a gene that introduces bias to an expression dataset may belongs to a family of genes having a polyA tail that is onaverage smaller or higher compared to an average length of polyA tailsof genes from a sample from which the RNA expression data was obtained(or another reference sample). In some embodiments, “smaller or higher”may refer to a numerical value that is smaller or higher relative to aknown, average threshold value of one or more genes.

In some embodiments, a gene that introduces bias to an expression dataset belongs to a family of genes selected from the group consisting of:histone-encoding genes, mitochondrial genes, interleukin-encoding genes,collagen-encoding genes, B cell receptor encoding genes, and T cellreceptor-encoding genes. In some embodiments, a gene that introducesbias to an expression data set can be any other gene that has a polyAtail that is on average smaller or higher compared to an average lengthof polyA tails of genes from a sample from which the RNA expression datawas obtained (or another reference sample).

In some embodiments, the histone-encoding genes, mitochondrial genes,interleukin-encoding genes, collagen-encoding genes, B cell receptorencoding genes, and/or T cell receptor-encoding genes are genes in thehuman sample that comprise a polyA tail that is on average smaller orhigher compared to an average length of polyA tails of genes from asample from which the RNA expression data was obtained. For example,histone-encoding genes comprise a polyA tail that is on average smallerto an average length of polyA tails of genes from a sample from whichthe RNA expression data was obtained. In some embodiments,histone-encoding genes do not comprise a polyA tail. In someembodiments, a polyA tail is minimally or not detected inhistone-encoding genes.

In some embodiments, one or more gene or protein abbreviations oracronyms are used in this application to refer to the genes (or genesencoding the proteins) using their recognized scientific nomenclature.Additional information about the genes and/or encoded proteins can befound in one or more genetic sequence databases, for example the NIHgenetic sequence database (GenBank, www.ncbi.nlm.nih.gov), the EMBLdatabase (the European Molecular Biology Laboratory nucleotide sequencedatabase, www.ebi.ac.uk/embl/index.html), the EMBL EuropeanBioinformatics Institute database (EMBL-EBI European Nucleotide Archive,www.ebi.ac.uk/ena), the GENCODE database (www.gencodegenes.org), orother suitable database, the contents of which are incorporated byreference herein for the different types of genes and names of genesreferred to herein. In some embodiments, the gene or proteinabbreviations or acronyms are referring to the human genes (or humangenes encoding the proteins).

In some embodiments, a histone-encoding gene is HIST1H1A, HIST1H1B,HIST1H1C, HIST1H1D, HIST1H1E, HIST1H1T, HIST1H2AA, HIST1H2AB, HIST1H2AC,HIST1H2AD, HIST1H2AE, HIST1H2AG, HIST1H2AH, HIST1H2AI, HIST1H2AJ,HIST1H2AK, HIST1H2AL, HIST1H2AM, HIST1H2BA, HIST1H2BB, HIST1H2BC,HIST1H2BD, HIST1H2BE, HIST1H2BF, HIST1H2BG, HIST1H2BH, HIST1H2BI,HIST1H2BJ, HIST1H2BK, HIST1H2BL, HIST1H2BM, HIST1H2BN, HIST1H2BO,HIST1H3A, HIST1H3B, HIST1H3C, HIST1H3D, HIST1H3E, HIST1H3F, HIST1H3G,HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4A, HIST1H4B, HIST1H4C, HIST1H4D,HIST1H4E, HIST1H4F, HIST1H4G, HIST1H4H, HIST1H4I, HIST1H4J, HIST1H4K,HIST1H4L, HIST2H2AA3, HIST2H2AA4, HIST2H2AB, HIST2H2AC, HIST2H2BE,HIST2H2BF, HIST2H3A, HIST2H3C, HIST2H3D, HIST2H3PS2, HIST2H4A, HIST2H4B,HIST3H2A, HIST3H2BB, HIST3H3, or HIST4H4. In some embodiments, amitochondrial gene is MT-ATP6, MT-ATP8, MT-CO1, MT-CO2, MT-CO3, MT-CYB,MT-ND1, MT-ND2, MT-ND3, MT-ND4, MT-ND4L, MT-ND5, MT-ND6, MT-RNR1,MT-RNR2, MT-TA, MT-TC, MT-TD, MT-TE, MT-TF, MT-TG, MT-TH, MT-TI, MT-TK,MT-TL1, MT-TL2, MT-TM, MT-TN, MT-TP, MT-TQ, MT-TR, MT-TS1, MT-TS2,MT-TT, MT-TV, MT-TW, MT-TY, MTRNR2L1, MTRNR2L10, MTRNR2L11, MTRNR2L12,MTRNR2L13, MTRNR2L3, MTRNR2L4, MTRNR2L5, MTRNR2L6, MTRNR2L7, orMTRNR2L8.

In some embodiments, removing expression data for at least one gene thatintroduces bias in the gene expression data comprises removingexpression data for one or multiple (e.g., at least 2, at least 5, atleast 10, at least 15, at least 20, at least 30, at least 40, at least50, at least 60, at least 70, at least 80, at least 90, at least 100, atleast 150, at least 200, at least 250, at least 300, at least 350, atleast 400, at least 450, at least 500, between 2 and 1000, or anysuitable number of genes in these ranges) genes in each of one ormultiple (2, 3, 4, 5, or all) gene families including histone-encodinggenes, mitochondrial genes, interleukin-encoding genes,collagen-encoding genes, B cell receptor-encoding genes, and T cellreceptor-encoding genes). In some embodiments, removing expression datafor at least one gene that introduces bias in the gene expression datacomprises removing expression date for any of one or more genes thathave a polyA tail that is on average smaller or higher compared to anaverage length of polyA tails of genes from a sample from which the RNAexpression data was obtained (or a reference sample).

In some embodiments, after expression data for at least one gene thatintroduces bias is removed from the gene expression data, the remaininggene expression data may be normalized again (“renormalized”) (e.g., toTPM or any other suitable unit such as reads per kilobase million (RPKM)or fragments per kilobase million (FPKM)) so that the normalizedexpression values are not biased by the expression data of the biasinggene(s), which was removed. In some embodiments, the remaining geneexpression data may have expression data for at least 1,000 genes, atleast 5,000 genes, at least 10,000 genes, between 500 and 5000 genes,between 1000 and 10,000 genes, between 5,000 and 15,000 genes or anysuitable number of genes within these ranges.

Post-Sequencing Nucleic Acid Data Quality Control

As provided in the present disclosure, quality control is regularlyperformed during sample preparation processes. For example, the purityof the extracted nucleic acids or the size distribution of the DNAlibraries) are detected. When one or more of the quality control issuesoccurs and is not able to be remedied in the laboratory, the provider(e.g., healthcare provider) of the biological sample is notified beforeproceeding to the subsequence steps. After the issues in connection tothe quality are solved, the processes of sample preparation arecompleted and bioinformatics analysis (e.g., post-sequencing) isperformed.

Aspects of methods and systems described herein provide for qualitycontrol to be performed on gene expression data to improve the accuracyand reliability of subsequent expression analysis (e.g., to determine adiagnosis, prognosis, and/or treatment for the patient or subject) andany resulting recommendation.

In some embodiments, bioinformatic quality control of sequence data canbe conducted as a standalone process (e.g., based on nucleic acid datathat is received from a healthcare provider) or in connection with aprior sample preparation process (e.g., if a patient sample is providedby the healthcare provider as opposed to nucleic acid sequence data). Asillustrated in FIG. 7, act 301 to act 310 illustrate non-limiting samplepreparation processes as described in the present disclosure, whereasact 311 to act 315 illustrate non-limiting quality control processes asdescribed in the present disclosure. In some embodiments, one or more ofact 301 to act 310 can be performed independently (e.g., without one ormore of act 311 to act 315). In some instances, one or more of act 301to act 310 can be skipped or delayed. Act 311 to act 315 can beperformed independently (e.g., without act 301 to act 310). In someinstances, one or more of act 311 to act 315 can be skipped or delayed.In some instances, one or more sample preparation (act 301 to act 310)and quality control (act 311 to act 315) processes can both beperformed. In some instances, one or more of the sample preparationprocesses and one or more of the quality control processes can beperformed.

In some embodiments, a process pipeline 300 is performed by obtaining afirst tumor sample from a subject having, suspected of having, or atrisk of having cancer at act 301, extracting RNA from the first sampleof the first tumor at act 302, enriching the extracted RNA for codingRNA to obtain enriched RNA at act 303, preparing a first library of cDNAfragments from the enriched RNA for non-stranded RNA sequencing at act304, obtaining RNA expression data for a subject having, suspected ofhaving, or at risk of having cancer at act 305, aligning and annotatinggenes in the RNA expression data with known sequences of the humangenome to obtain annotated RNA expression data at act 306, removingnon-coding transcripts from the annotated RNA expression data at act307, converting the annotated RNA expression data to gene expressiondata in transcripts per kilobase million (TPM) at act 308, identifyingat least one gene that introduces bias in the gene expression data atact 309, removing at least one gene from the gene expression data toobtain bias-corrected gene expression data at act 310, obtainingsequence information and asserted information at act 311, determiningone or more features from sequence information at act 312, determiningwhether one or more features match asserted information at act 313,making at least one additional determination of the features at act 314,identifying a cancer treatment for the subject using the bias-correctedgene expression data at act 315.

In some embodiments, act 305 may comprise obtaining the RNA expressiondata by using a sequencing platform or by receiving from a healthcareprovider or laboratory. In some embodiments, act 306 may compriseconverting the RNA expression data to gene expression data. As describedherein, the “known sequence of the human genome” may refer to areference. In some embodiments, act 307 may comprise converting the RNAexpression data to gene expression data. In some embodiments, act 307may comprise obtain filtered RNA expression data. In some embodiments,act 308 may comprise normalizing the filtered RNA expression data toobtain gene expression data in transcripts per kilobase million (TPM).In some embodiments, the asserted information act 311 may indicate anasserted source and/or an asserted integrity of the sequence data. Insome embodiments, act 312 may comprise determining one or more diseasefeatures. In some embodiments, act 312 may comprise processing thesequence information or data to obtain determined information indicatinga determined source and/or a determined integrity of the sequenceinformation or data. In some embodiments, act 313 may comprisedetermining whether the determined information matches the assertedinformation. In some embodiments, the at least one additionaldetermination of the feature at process 314 may comprise determiningdisease features or features that are not directly related to diseases.

Aspects of methods and systems described herein provide an approach forvalidating nucleic acid sequence data by obtaining both the sequencedata and asserted information related to one or more features of thesequence data (e.g., source, type of nucleic acid, expected integrity,etc.), determining one or more features from the sequence data, andverifying that the one or more features determined from the sequencedata match the asserted information about those features. In someembodiments, the asserted information can be information about thepatient, tissue type, tumor type, nucleic acid type (RNA, DNA, WES,polyA, etc.), sequencing protocol that was used, etc., or a combinationthereof. In some embodiments, the asserted information can be anexpected and/or acceptable (e.g., acceptable for a subsequent analysisof the sequence data) integrity threshold for the sequence information,including for example, an expected and/or acceptable level of GCcontent, contamination, coverage (e.g., genome, exome, exon, proteinencoding, or other coverage) or other measure of integrity.

Nucleic acid sequencing, next generation sequencing (NGS) in particular,allows for the generation of large amounts of information for a givennucleic acid (DNA, RNA, genome, exome, transcriptome, etc.). However,because of the many different sequencing platforms that are available,the variety of sample preparation and sequencing protocols andtechniques that are used, and the variability and inconsistency betweenplatforms and protocols, there is substantial variability in the contentand coverage of the resulting nucleic acid sequence information.Moreover, when evaluating sequence information from several sequencingruns, or large sets of sequence information from a plurality ofsequencing runs (e.g., including, for example, historical data fromdifferent medical visits for one or more patients) or from differentstudies (e.g., from studies to create prognostic or diagnosticevaluations, or from studies to evaluate the effect of a drug ortreatment on the progression of a disease, etc.) it can be can bechallenging to combine sequence information from different sources. Inaddition, in can be challenging to detect incorrectly identifiedsequence data when large amounts of information are being combined fromdifferent sources.

Currently, no robust methods exist to validate (e.g., raise theconfidence, reduce the uncertainty, correct for or omit low qualitysequence information, provide a signal to verify or retest questionablesequence information or outliers, etc.) source and/or integrity (e.g.,also as may be referred to herein as quality) of sequence informationwhich may be the subject of further use (e.g., being used for analysisbeyond the initial sequencing step), for example for diagnostic,prognostic, and/or clinical applications.

The disclosure recognizes the prevalence of next generation sequencingtechniques and platforms employed across a variety of disciplines withinthe scientific community. The disclosure also recognizes the variety ofprotocols and methodologies associated with the different techniques andplatforms employed. The variation in the platforms, and protocols to usethe various platforms, creates variability within the data and sequenceinformation realized from the use thereof, which presents a significanthurdle in using the sequence information for substantive analysis,especially if such sequence information is to be used for analysisbeyond the initial data run by the original user of the sample (e.g., bya secondary user, beyond the user who procured and performed the initialsequencing, third parties to the sequencing, etc.).

Accordingly, the disclosure presents a variety of methods and processesto assess the quality of sequence information (e.g., for correctidentification of the sequence information, sample identification,subject identification, etc.), as well as to assess the integrity of thesequence information (e.g., create checkpoints to screen for variousintegrity issues, for example, contamination or degradation). Forexample, in some embodiments, described herein are methods forevaluating sequence information by obtaining sequence information from anucleic acid of a sample of a subject, obtaining asserted information,determining a feature (e.g., source, identity, status, characteristic)of the sequence information, and comparing the asserted information withthe determined information. The sequence information may be obtained(e.g., acquired) from any source, or through any means known in the art.Accordingly, the sequence information may be generated using anysuitable sequencing technology. Alternatively, the sequence informationmay be obtained electronically from a third party that generated thesequence information. In some embodiments, sequence information (e.g.,reference sequence information) is obtained from an existing databank ofsequences. In some embodiments, sequence information is obtained from acompany, a non-profit organization, an academic institution, or ahealthcare organization.

In some embodiments, a sample may be any specimen, biopsy, or biologicalcomponent obtained (e.g., procured, taken, received) from a subject. Forexample, in some embodiments, the sample may be a blood sample, hairsample, tissue sample, bodily fluid sample, cell sample, blood componentsample, or any other cell or tissue sample from which a nucleic acid maybe obtained for sequencing.

In some embodiments, the subject may be any organism in need oftreatment or diagnosis using methods or systems of the disclosure. Forexample, without limitation, subjects may include mammals andnon-mammals. As used herein, a “mammal,” refers to any animalconstituting the class Mammalia (e.g., a human, mouse, rat, cat, dog,sheep, rabbit, horse, cow, goat, pig, guinea pig, hamster, chicken,turkey, or a non-human primate (e.g., Marmoset, Macaque)). In someembodiments, the mammal is a human. In some embodiments, the subject isa mammal. In some embodiments, the subject is a human.

In some embodiments, a sample may be a biological sample obtained from asubject, e.g., from a patient. In some embodiments, a sample may beblood, serum, sputum, urine, or a tissue biopsy (e.g., from any tissue,including but not limited to heart, liver, pancreas, CNS,gastrointestinal tract, mouth, colon, kidney, and skin). In someembodiments, a sample may be suspected to be a disease sample (e.g., acancer sample). In some embodiments, a sample may be a healthy sample(e.g., to be used as a reference).

In some embodiments, sequence information is obtained from a nextgeneration sequencing platform (e.g., Illumina™, Roche™, Ion Torrent™,etc.), or any high-throughput or massively parallel sequencing platform.In some embodiments, these methods may be automated, in someembodiments, there may be manual intervention. In some embodiments, thesequence information may be the result of non-next generation sequencing(e.g., Sanger sequencing). In some embodiments, the sample preparationmay be according to manufacturer's protocols. In some embodiments, thesample preparation may be custom made protocols, or other protocolswhich are for research, diagnostic, prognostic, and/or clinicalpurposes. In some embodiments, the protocols may be experimental. Insome embodiments, the origin or preparation method of the sequenceinformation may be unknown.

In some embodiments, the size of the obtained RNA and/or DNA sequencedata comprises at least 5 kilobases (kb). In some embodiments, the sizeof the obtained RNA and/or DNA sequence data is at least 10 kb. In someembodiments, the size of the obtained RNA and/or DNA sequence data is atleast 100 kb. In some embodiments, the size of the obtained RNA and/orDNA sequence data is at least 500 kb. In some embodiments, the size ofthe obtained RNA and/or DNA sequence data is at least 1 megabase (Mb).In some embodiments, the size of the obtained RNA and/or DNA sequencedata is at least 10 Mb. In some embodiments, the size of the obtainedRNA and/or DNA sequence data is at least 100 Mb. In some embodiments,the size of the obtained RNA and/or DNA sequence data is at least 500Mb. In some embodiments, the size of the obtained RNA and/or DNAsequence data is at least 1 gigabase (Gb). In some embodiments, the sizeof the obtained RNA and/or DNA sequence data is at least 10 Gb. In someembodiments, the size of the obtained RNA and/or DNA sequence data is atleast 100 Gb. In some embodiments, the size of the obtained RNA and/orDNA sequence data is at least 500 Gb.

In some embodiments, the sequence information may be generated using anucleic acid from a sample from a subject. In some embodiments, thesequence information may be a sequence data indicating a nucleotidesequence of DNA and/or RNA from a previously obtained biological sampleof a subject having, suspected of having, or at risk of having adisease. In some embodiments, the nucleic acid is deoxyribonucleic acid(DNA). In some embodiments, the nucleic acid is prepared such that thewhole genome is present in the nucleic acid. In some embodiments, thenucleic acid is processed such that only the protein coding regions ofthe genome remain (e.g., exomes). When nucleic acids are prepared suchthat only the exomes are sequenced, it is referred to as whole exomesequencing (WES). A variety of methods or known in the art to isolatethe exomes for sequencing, for example, solution based isolation whereintagged probes are used to hybridize the targeted regions (e.g., exomes)which can then be further separated from the other regions (e.g.,unbound oligonucleotides). These tagged fragments can then be preparedand sequenced.

In some embodiments, the nucleic acid is ribonucleic acid (RNA). In someembodiments, sequenced RNA comprises both coding and non-codingtranscribed RNA found in a sample. When such RNA is used for sequencingthe sequencing is said to be generated from “total RNA” and also can bereferred to as whole transcriptome sequencing. Alternatively, thenucleic acids can be prepared such that the coding RNA (e.g., mRNA) isisolated and used for sequencing. This can be done through any meansknown in the art, for example by isolating or screening the RNA forpolyadenylated sequences. This is sometimes referred to as mRNA-Seq.Sequence information can include the sequence data generated by thenucleic acid sequencing protocol (e.g., the series of nucleotides in anucleic acid molecule identified by next-generation sequencing, sangersequencing, etc.) as well as information contained therein (e.g.,information indicative of source, tissue type, etc.) which may also beconsidered information that can be inferred or determined from thesequence data. For example, in some embodiments RNA sequence informationmay be analyzed to determine whether the nucleic acid was primarilypolyadenylated or not.

Asserted information can refer to information about the sequence data,and by extension, the nucleic acid, the sample, and/or the subject fromwhich the sequence data was obtained. In some embodiments, assertedinformation is provided along with the sequence data and can be verifiedby analyzing the sequence data as described herein. The assertedinformation may relate to a feature of the nucleic acid, the sample, orthe subject, and can be used to evaluate the quality of the nucleic acid(e.g., the source or integrity of the nucleic acid). Assertedinformation can refer to an asserted source and/or an asserted integrityof the sequence data or information.

In some embodiments, a third party may provide sequence data as well asthe related asserted information. In some embodiments, the assertedinformation is obtained from the same entity that the sequence data isobtained from. In some embodiments, the asserted information andsequence data are obtained from different parties. In some embodiments,the asserted information is obtained from a database. In someembodiments, the asserted information is a reference value or property.In some embodiments, the asserted information may allege an identity ofthe sequence information, an identity of the nucleic acid of thesequence information, an identity of the sample from which the sequenceinformation was generated, an identity of the subject from which thesample was obtained. In some embodiments, the asserted information mayidentify the sequence data as obtained from polyadenylated RNA, asoriginating from whole transcriptome sequencing, or as being from WES.In some embodiments, the asserted information may identify a cell ortissue type for the sample from which the nucleic acid was obtained. Insome embodiments, the asserted information may allege a tumor type forthe sample from which the nucleic acid was obtained. In someembodiments, the asserted information may identify an MHC profile (e.g.,sequences for alleles of the MHCs of the subject from which the nucleicacid was obtained) for the subject from which the sample was obtained.In some embodiments, the asserted information may identify an expectedprotein subunit ratio for the sample. In some embodiments, the assertedinformation may provide an expected complexity value for the sequenceinformation. In some embodiments, the asserted information may providean expected contamination value for the sequence information. In someembodiments, the asserted information may provide an expected coveragevalue for the sequence information. In some embodiments, the assertedinformation may provide an expected exon coverage value for the sequenceinformation. In some embodiments, the asserted information may providean expected read composition value for the sequence information. In someembodiments, the asserted information may provide an expected Phredscore for the sequence information. In some embodiments, the assertedinformation may provide an expected single nucleotide polymorphism (SNP)value for the sequence information. In some embodiments, the assertedinformation may relate to a GC content value for the sequenceinformation. In some embodiments, the asserted information may compriseadditional information. In some embodiments, the asserted informationmay comprise information relating to multiple or more than one featureof the sequence information. In some embodiments, the assertedinformation is any combination of the aforementioned features (e.g.,determined values, properties, characteristics, etc.).

As used herein, a “feature” may be a property or characteristic, whichis determined from analysis of the sequence information which providesthe user with information about the sequence information, the samplefrom which it was taken, and/or the subject from which the sample wastaken, beyond the sequence of the nucleotides of the sequenceinformation. The sequence information may be in connection with the geneexpression data obtained from a healthcare provider or a laboratory. Forexample, a feature may be indicative of a source (e.g., patient,subject, nucleic acid type), patient or subject identity, tissue type,tumor type, polyadenylation status, MHC sequence, protein subunits,complexity, contamination, coverage (e.g., total sequence, exon, etc.),read composition, quality and/or Phred Score, single nucleotidepolymorphism (SNP) positions, and/or GC content. The feature(s) of thesequence information can then be indicative of whether the sequenceinformation is potentially a match or mismatch to the assertedinformation from a healthcare provider or a laboratory.

When considering identity or source, it is important to recognize thatthe term not only refers to identifying a specific subject or patient asa particular individual, but also that one or more features of sequenceinformation for one sample can be identified as being the same as theone or more features of sequence information obtained from anothersample. For example, sequence information A may be compared to sequenceinformation B which is presented and asserted to be from the samenucleic acid, subject or patient, tissue, or tumor. The identity can becorroborated or questioned by the methods herein without knowing theactual identity of the subject but can support the finding that theidentity is consistent with another given sequence information. In someembodiments, the identity of the sequence information is used to compareto asserted information for a given sample, subject, tissue, or tumor.In some embodiments the identity of the sequence information is used tocompare the asserted information for another nucleic acid or referencevalue.

In some embodiments, these determined features of the sequenceinformation are then evaluated (e.g., determined, matched, aligned,measured, assessed) against the asserted information. This evaluationcan be done to increase the confidence that the sequence information isof a particular origin (e.g., source), is identified correctly, or has aparticular or specific characteristic (e.g., is from polyadenylatednucleic acids). In this respect, the methods can be used to providecheckpoints and measures to highlight any potential problems (e.g.,non-matching values (e.g., for determined features and assertedinformation), or determined values which fall outside of accepted orestablished ranges). Such problems can signify or signal problems withintegrity (e.g., degraded or contaminated) or source (e.g.,misidentified, wrongly labeled, etc.) of the sequence data. By using themethods and processes herein, and matching determined features toasserted information for given sequence information, it lessens thepossibility of incorrect or poor quality sequence data being used foranalysis and raises the confidence that the sequence data is ofsufficient quality to be used for diagnostic, prognostic, and/orclinical analyses.

In some embodiments, evaluating whether determined information matchesasserted information involves determining whether the determinedinformation matches asserted information exactly are within a specifiedthreshold. More generally, in some embodiments, evaluating whether twovalues “match” may involve determining whether the two values matchexactly or are within a specified threshold. That threshold may be 0,requiring exact matches in some embodiments. That threshold may begreater than 0 such that when numerical values are being compared, thenumerical values may be said to “match” if their values are within thethreshold of one another (e.g., when the absolute difference of thenumerical values is less than or equal to the threshold value). In someembodiments, the threshold may be set as a function of a standarddeviation (or a multiple thereof), a quantile, a percentile, or anyother suitable statistical quantity. In some embodiments, evaluatingwhether two values “match” may involve determining, when there is adifference between the two values, whether that difference isstatistically significant. Such a determination may be performed using astatistical hypothesis test, a threshold, or any other suitablestatistical or mathematical technique, as aspects of the technologydescribed herein are not limited in this respect.

In some embodiments, one or more quality control parameters are checkedfor the bioinformatics data. In some embodiments, tumor purity can bechecked. Tumor purity, as described herein, may refer to the proportionof cancer cells in the admixture. In some embodiments, the target tumorpurity for the WES is ≥20% (e.g., 20%, 40%, 60%). In some embodiments,the target tumor purity for the RNA-seq is ≥20% (e.g., 20%, 40%, 60%).In some embodiments, depth of coverage can be checked. In someembodiments, the depth of coverage for the WES is ≥150× average coverageof tumor sample (e.g., 150×, 180×, 200×). In some embodiments, thetarget depth of coverage for the RNA-seq is ≥100× (e.g., 100×, 150×,200×).

In some embodiments, alignment rate can be checked. In some embodiments,the target alignment rate for the WES is more than 90% (e.g., 91%, 95%,99%). In some embodiments, the target alignment rate for the RNA-seq ismore than 90% (e.g., 91%, 95%, 99%).

In some embodiments, base call quality scores such as Phred score can bechecked. In some embodiments, the target Phred score for the WES is morethan 30 (e.g., 35, 40, 50). In some embodiments, the target Phred scorefor the RNA-seq is more than 30 (e.g., 35, 40, 50). In some embodiments,uniformity of coverage can be checked. In some embodiments, the targetuniformity of coverage for the WES is 85% base pairs in target regionscovered ≥20× for tumor tissue (e.g., 85%, 95%, 99%). In someembodiments, the target uniformity of coverage for the WES is 85% basepairs in target regions covered ≥20× for normal tissue (e.g., 85%, 95%,99%). In some embodiments, target regions for determining uniformity ofcoverage may be ExonV7 target regions with the use of the coding regionsfrom CCDS (consensus coding sequence) genes.

In some embodiments, GC bias can be checked. In some embodiments, thetarget GC bias for the WES is at least 50 (e.g., 50, 60, 70). In someembodiments, the acceptable range of GC bias for the WES is at least45-65 (e.g., 45-65, 50-65, 55-65). In some embodiments, the target GCbias for the RNA-seq is at least 50 (e.g., 50, 60, 70). In someembodiments, the acceptable range of GC bias for the RNA-seq is at least45-65 (e.g., 45-65, 50-65, 55-65).

In some embodiments, mapping quality can be checked. In someembodiments, the mapping quality for the WES is ≥10 (e.g., 10, 20, 30).

In some embodiments, duplication rate can be checked. In someembodiments, the duplication rate for the WES is less than 30% (e.g.,29.9%, 25%, 15%). In some embodiments, the duplication rate for theRNA-seq is less than 85% (e.g., 84.99%, 80%, 70%).

In some embodiments, insert size can be checked. In some embodiments,the acceptable median insert size for tumor tissue for the WES is about150 (e.g., 150, 280, 250). In some embodiments, the target median insertsize for tumor tissue for the WES is about 200 (e.g., 200, 250, 350). Insome embodiments, the acceptable median insert size for normal tissuefor the WES is about 150 (e.g., 150, 280, 250). In some embodiments, thetarget median insert size for normal tissue for the WES is about 200(e.g., 200, 250, 350). In some embodiments, the acceptable median insertsize for tumor tissue for the RNA seq is about 150 (e.g., 150, 280,250). In some embodiments, the target median insert size for tumortissue for the RNA seq is about 200 (e.g., 200, 250, 350).

In some embodiments, contamination can be checked. In some embodiments,contamination acceptable for the WES is less than 0.05% (e.g., 0.04%,0.03%, 0.01%). In some embodiments, contamination acceptable for theRNA-seq is less than 0.05% (e.g., 0.04%, 0.03%, 0.01%).

In some embodiments, SNP concordance of a pair of tumor versus normalsamples from the same patient can be checked. In some embodiments, thetarget SNP concordance for the WES is more than 90% (e.g., 91%, 95%,98%). In some embodiments, the acceptable SNP concordance for the WES ismore than 85% (e.g., 86%, 90%, 98%). In some embodiments, the target SNPconcordance for the RNA seq is more than 90% (e.g., 91%, 95%, 98%). Insome embodiments, the acceptable SNP concordance for the RNA seq is morethan 85% (e.g., 86%, 90%, 98%).

In some embodiments, HLA allele concordance of a pair of tumor versusnormal samples from the same patient can be checked. In someembodiments, the threshold for normal versus tumor tissue for the WES isless than 5 (e.g., 4.5, 3, 2.5). In some embodiments, the threshold fortumor RNA seq tissue versus normal WES tissue for the RNA seq is lessthan 5 (e.g., 4.5, 3, 2.5).

In some embodiments, sequence information can be assessed for genomecontamination (e.g., non-human genome contamination). In someembodiments, the samples or sequence information are assessed todetermine whether they are contaminated by determining whether theycontain sequences from other species or reference genomes such as mouse,zebrafish, Drosophila, Celegans, Saccharomyces, Arabidopsis, microbiome,mycoplasma, adapters, UniVec, and phiX rRNA. In some embodiments, thetarget threshold for the ADA genomes contamination for the WES is morethan 60 (e.g., 65, 70, 80). In some embodiments, the acceptablethreshold for the ADA genomes contamination for the WES is more than 40(e.g., 45, 60, 80). In some embodiments, the target threshold for theADA genomes contamination for the RNA seq is more than 40 (e.g., 50, 60,80). In some embodiments, the acceptable threshold for the ADA genomescontamination for the RNA seq is more than 20 (e.g., 30, 50, 70).

In some embodiments, only one feature is evaluated against an assertedinformation. In some embodiments, more than one feature is evaluatedagainst an asserted information. In some embodiments, at least two ormore features are evaluated against an asserted information. In someembodiments, at least three or more features are evaluated against anasserted information. In some embodiments, at least four or morefeatures are evaluated against an asserted information. In someembodiments, at least five or more features are evaluated against anasserted information. In some embodiments, at least six or more featuresare evaluated against an asserted information. In some embodiments, atleast seven or more features are evaluated against an assertedinformation. In some embodiments, at least eight or more features areevaluated against an asserted information. In some embodiments, at leastnine or more features are evaluated against an asserted information. Insome embodiments, at least ten or more features are evaluated against anasserted information. In some embodiments, at least eleven or morefeatures are evaluated against an asserted information. In someembodiments, at least twelve or more features are evaluated against anasserted information. In some embodiments, at least thirteen or morefeatures are evaluated against an asserted information. In someembodiments, at least fourteen or more features are evaluated against anasserted information. In some embodiments, at least fifteen or morefeatures are evaluated against an asserted information.

In some embodiments, if the features or determined values are found tonot meet or match the asserted information, additional steps areperformed. In some embodiments, if the features or determined values arefound to not meet or match the asserted information, the sequenceinformation is rejected (e.g., is not used for subsequent analysis). Insome embodiments, if the features or determined values are found to notmeet or match the asserted information, the sequence information isretested, meaning that any evaluation of features or determinations areperformed for at least another or second, or more (e.g., third, fourth,fifth, sixth, etc.) time. In some embodiments, if the features ordetermined values are found to not meet or match the assertedinformation, another or second, or more (e.g., third, fourth, fifth,sixth, etc.) sequence information is obtained and then tested, meaningthat any evaluation of features or determinations are performed for atleast one time, or second, or more (e.g., third, fourth, fifth, sixth,etc.) time, independent of the initial determinations and evaluationsdone on the first sequence information. In some embodiments, if thefeatures or determined values are found to not meet or match theasserted information, the sequence information is reported to a user assuch. In some embodiments, any combination of these steps may beperformed in the event that features or determined values are found tonot meet or match the asserted information. In some embodiments, if thefeatures or determined values are found to not meet or match theasserted information, the sequence information can still be evaluatedfor characteristics related to disease (e.g., cancer), but informationabout the quality (e.g., the extent and nature of the one or morefeatures of the determined sequence information that do not match theasserted information) can be provided to a user (e.g., to a physician orother medical practitioner). In some embodiments, the characteristicsrelate to the type of cancer, its environment, its stage, its location,its tissue of origin, its statistical likelihood of responding tovarious treatments or therapies, or other properties which may aid apractitioner in treating the subject. In some embodiments, if thefeatures or determined values are found to meet or match the assertedinformation (e.g., match, exceed, or otherwise satisfy reference orthreshold values), then additional steps may be performed. In someembodiments, if the features or determined values are found to meet ormatch the asserted information (e.g., match, exceed, or otherwisesatisfy reference or threshold values), then additional steps may beperformed. In some embodiments, if the features or determined values arefound to meet or match the asserted information (e.g., match, exceed, orotherwise satisfy reference or threshold values), the sequenceinformation is evaluated for characteristics related to cancer. In someembodiments, the characteristics relate to the type of cancer, itsenvironment, its stage, its location, its tissue of origin, itsstatistical likelihood of responding to various treatments or therapies,or other properties which may aid a practitioner in treating thesubject.

In some embodiments, after one or more quality control steps areperformed, a report is generated for the user with the results of thequality control steps that were performed.

Accordingly, in one aspect, the disclosure relates to a method ofevaluating sequence information of at least one nucleic acid, todetermine at least one feature thereof. The at least one feature can beused to evaluate the quality or integrity of the sequence information,to interrogate the source of the sequence information, or to allow foranalyses of other sequence information, which may or may not be from thesame sequencing platform, or from the same or a different samplepreparation protocol. Further the at least one feature may be used as aquality control measure to ensure subsequent analyses of a thresholdquality and lower quality sequence information are omitted.

Accordingly, in one aspect, the disclosure relates to a method ofevaluating sequence information by (a) obtaining sequence informationwhich comprises: (1) sequence data from a first ribonucleic acid (RNA);or (2) sequence data from a first whole exome sequence (WES); and (b)determining one or more features of the sequence data selected from thegroup consisting of: (i) the identity of the subject from which thenucleic acid was obtained; (ii) a tissue of origin from which thenucleic acid obtained; (iii) a tumor type of from which the nucleic acidwas obtained; (iv) a quality measure of the first RNA sequence data; (v)whether the RNA sequence data was obtained from polyadenylated (polyA)RNA or total RNA; (vi) if the first sequence data set is first WESsequence data, (vii) the sequencing platform that was used to generatethe first sequence data set; and (viii) a quality measure of the firstsequence data set.

In some embodiments, the method further comprises obtaining additionalsequence information if the one or more features of the sequenceinformation is below a quality control threshold suitable for furtheranalysis.

In some embodiments, the evaluated feature is the subject identity. Insome embodiments, the subject identity is determined by performing oneor more of evaluations from the group comprising: a majorhistocompatibility complex evaluation and a SNP concordance evaluation,wherein the results of the evaluations are compared to an asserted valuefor the subject or a second sequence data set from the subject.

In some embodiments, the evaluated feature is the tissue of origin. Insome embodiments, the tissue of origin is determined by performing oneor more of evaluations from the group comprising: protein expression andbiomarker analysis. In another aspect, the disclosure relates to amethod of evaluating a feature comprising assigning a tissue of originof the sample which generated the sequence information. In someembodiments, the method comprises evaluating the sequence informationfor markers or gene expression indicative of a tissue type from whichthe sequence information originated. In some embodiments, the methodcomprises evaluating the markers or gene expression against a databaseof the same for different tissue types. Different tissues throughout asubject express different proteins which create a profile of such atissue. Accordingly, it is possible to evaluate the protein expressionprofile and match it to a tissue type to identify the tissue from whichthe sample, and by extension the sequence information, was obtained.This can be done through a variety of methods known in the art. Forexample, evaluating the number of a given messenger RNA (mRNA)transcript (e.g., using it as a proxy for evaluating protein expression)can be evaluated against a database of known tissue markers (e.g.,protein expression profiles), it can be evaluated against a provided setof markers for a subject, or it can be evaluated against a secondsequence information or set of tissue markers obtained from the subject.In some embodiments, a tissue of origin is determined by evaluating thesequence information for markers (e.g., protein expression) and matchingthe markers with a database of tissues. In some embodiments, a tissue oforigin is determined by evaluating the sequence information for markers(e.g., protein expression) and matching the markers with a set ofmarkers from a tissue of a subject. In some embodiments, a tissue oforigin is determined by evaluating the sequence information for markers(e.g., protein expression) and matching the markers with a secondsequence information obtained from a subject where the tissue of originis known.

In some embodiments, the evaluated feature is a measure of the integrityof the sequence information. In some embodiments, the integrity measureof the first RNA sequence data is determined by performing one or moreof evaluations from the group comprising: determining coverage of one ormore genes in the RNA sequence data, determining relative coverage oftwo or more exons for at least one gene in the RNA sequence data,determining an expression ratio of two known reference genes from theRNA sequence data, or other feature or combinations of two or morethereof. In some embodiments, the integrity measure of the DNA sequencedata is determined by performing one or more of evaluations from thegroup comprising: total coverage and/or chromosomal coverage of the DNAsequence data, or other feature or combinations of two or more thereof.

In some embodiments, the RNA sequence data is analyzed to determinewhether it was obtained from polyA RNA or total RNA. In someembodiments, the RNA sequence data is analyzed by evaluating anexpression level of one or more mitochondrial or histones genes from theRNA sequence data, and/or other features that are characteristic ofpolyA or total RNA. In some embodiments, the feature being evaluated isthe sequencing platform that was used to generate the sequence. In someembodiments, the sequencing platform used for generating the WESsequence data is determined by performing one or more of evaluationsfrom the group comprising: determining % variance for one or morereference genes in the WES sequence data, or other property ofsequencing data that is characteristic of the sequencing platform thatwas used to generate the sequence data.

In some embodiments, methods comprise evaluating at least one of thefeatures described herein. In some embodiments, a method comprisesevaluating at least two of the features described herein. In someembodiments, a method comprises evaluating at least three of thefeatures described herein. In some embodiments, a method comprisesevaluating at least four of the features described herein. In someembodiments, a method comprises evaluating at least five of the featuresdescribed herein. In some embodiments, a method comprises evaluating atleast six of the features described herein. In some embodiments, amethod comprises evaluating at least seven of the features describedherein.

In some embodiments, the quality (e.g., source or integrity) of sequenceinformation from one or more nucleic acid samples (e.g., at least twonucleic acid samples) is evaluated by (a) determining the sequence oftwo or more (e.g., 2, 3, 4, 5, 6 or more) major histocompatibilitycomplexes (MHCs), and (b) determining whether the MHCs from the one ormore samples match. In some embodiments, if the MHCs don't match (e.g.,if a calculated agreement value is less than a statistically significantthreshold) the sequence information from each of the nucleic acids isdeemed likely from distinct sources, of insufficient quality, and isremoved, discarded, retested, and/or reported as such to a user. In someembodiments, if the calculated agreement value (x) between WESnormal/tumor/RNAseq is 0<x≤2 (e.g., 1, 1.5, 2), it represents acceptableand “warning.” Warning means that the calculated agreement value iswithin a range that is deemed acceptable but is considered close to benot acceptable. In some embodiments, if the calculated agreement value(x) between WES normal/tumor/RNAseq is >5, it represents not acceptableor bad quality. In some embodiments, if the calculated agreement value(x) between WES normal/tumor/RNAseq is 0, it represents good quality. Insome embodiments, if the MHCs match (e.g., if the agreement value is ator above the statistically significant threshold) the sequenceinformation from each of the nucleic acid samples is deemed sufficientlylikely from the same source, of sufficient quality, and is retained forfurther analysis and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bydetermining a concordance value for single nucleotide polymorphisms(SNPs) in the sequence information. In some embodiments, the methodfurther comprises evaluating the concordance value. In some embodiments,if the concordance value is less than 85%, less than 80%, or less than75%, the sequence information from each of the nucleic acid samples isdeemed likely from distinct sources, of insufficient quality, and isremoved, discarded, retested, and/or reported as such to a user. In someembodiments, if the concordance value is less than 75%, the sequenceinformation is deemed not acceptable. In some embodiments, if theconcordance value is more than 80% and less than 95%, the sequenceinformation is deemed within the ranges that are close to be notacceptable. In some embodiments, if the concordance value is more than95%, the sequence information is deemed acceptable. In some embodiments,if the concordance value is at least 75%, at least 80%, or at least 85%,the sequence information from each of the nucleic acid samples is deemedsufficiently likely from the same source, of sufficient quality, and isretained, and/or reported as such to a user. In some embodiments, atleast 5,000 SNPs can be evaluated for concordance values. In someembodiments, at least 6,000 SNPs can be evaluated for concordancevalues. In some embodiments, at least 7,000 SNPs can be evaluated forconcordance values. In some embodiments, at least 8,000 SNPs can beevaluated for concordance values.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bydetermining a contamination value for the sequence information. In someembodiments, if the contamination value is above a statisticallysignificant threshold, the sequence information is removed, discarded,retested, and/or reported as such to a user. In some embodiments, if thecontamination value is more than 0.05% (e.g., 0.06%, 1%, 2%), thesequence information is deemed to be close to not acceptable (e.g.,warning). In some embodiments, if the contamination value is more than0.1% (e.g., 0.1%, 0.5%, 1%), the sequence information is deemed to benot acceptable for blood sample and fresh frozen tissue. In someembodiments, if the contamination value is below the threshold, thesequence information is retained, and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated by analyzingsequence information from the one or more nucleic acid samples against aset of tumor types, determining predicted tumor type(s) from thesequence information, and determining whether the predicted tumortype(s) matches the tumor type(s) that were provided (e.g., asserted)for the one or more nucleic acid samples. In some embodiments,determining predicted tumor type(s) as a quality control step can beperformed by using a computerized system or process as described herein.In some embodiments, determining predicted tumor type(s) as a qualitycontrol step can be performed by determining a cancer grade fromsequence data by using machine learning techniques, as described hereinand in the U.S. Provisional Patent Application Ser. No. 62/943,976,titled “Machine Learning Techniques for Gene Expression Analysis,” filedon Dec. 5, 2019, which is incorporated by reference herein in itsentirety. In some embodiments, if there is disagreement between thetumor type(s) (e.g., cancer grades) obtained from the sequenceevaluation and an asserted information, the sequence information isidentified as suspect, or insufficient quality, and is removed,discarded, retested, and/or reported as such to a user. In someembodiments, if there is agreement between the predicted tumor type(s)and expected tumor type(s) for the one or more nucleic acid samples, thesequence information is deemed of sufficient quality, and is retained,and/or reported as such to a user.

In some embodiments, matching the predicted tumor type(s) to the tumortype(s) that were provided comprise using a set of reference genes froma training data set containing a plurality of signature genes that areup-regulated or down-regulated in a certain tumor type, relative to anormal, healthy sample. For instance, if the predicted tumor type isprostate cancer (e.g., asserted information), the sample will be checkedagainst the known reference genes of prostate cancer. In someembodiments, the predicted tumor type is also evaluated against itstumor grade, which can help determine the signature genes of theasserted cancer grade at different stages of cancer.

As described above, in some embodiments, determining predicted tumortype(s) as a quality control step may be performed by determining acancer grade from sequence information by using a machine learningapproach employing a statistical model trained using training data.

For example, in some embodiments, a statistical model may be used topredict characteristic(s) of a biological sample, using gene expressiondata, based on an input ranking of genes, ranked based on theirrespective expression levels, for a sequencing platform. Using the inputranking(s), instead of the specific values for the expression levels,allows for the same or similar data processing pipeline to be usedacross different expression data regardless of the specific manner inwhich the expression levels were obtained (e.g., regardless of whichsequencing platform, sequencing conditions, sample preparation, dataprocessing to obtain expression levels, etc.). In some embodiments, astatistical model may be used to predict cancer grade of the biologicalsample. In some embodiments, a statistical model may be used to predicttissue of origin of the biological sample, which also may be used forperforming quality control as described herein.

For example, in some embodiments, rankings of genes based on the geneexpression levels (in a biological sample) as determined by a sequencingplatform may be provided as input to a statistical model trained topredict tissue of origin for the biological sample. The predicted tissueof origin may be compared against asserted tissue of origin as part ofthe quality control techniques described herein. As another example, insome embodiments, rankings of genes based on the gene expression levels(in a biological sample) as determined by a sequencing platform may beprovided as input to a statistical model trained to predict cancer gradefor the biological sample. The predicted cancer grade may be comparedagainst asserted cancer grade as part of the quality control techniquesdescribed herein.

In some embodiments, the set of genes being ranked depends on theparticular biological characteristic of interest. For example, one setof genes may be used for determining the tissue of origin and anotherset of genes may be used for determining the cancer grade.

In some embodiments, the expression data may be obtained for cells inthe biological sample, where the subject has, is suspected of having oris at risk of having cancer. In the context where tissue of origin is acharacteristic being determined, the tissue of origin is for the cellsin the biological sample. The tissue of origin may refer to a particulartissue type from which the cells originate, such as lung, pancreas,stomach, colon, liver, bladder, kidney, thyroid, lymph nodes, adrenalgland, skin, breast, ovary, and prostate.

For example, some embodiments involve using a gene set for predictingtissue of origin, which may include cell of origin, for Diffuse LargeB-Cell Lymphoma (DLBCL), such as germinal center B-cell (GCB) andactivated B-cell (ABC). Genes in the gene set may be selected from thegroup consisting of: ITPKB, MYBL1, LMO2, BATF, IRF4, LRMP, CCND2, SLA,SP140, PIM1, CSTB, BCL2, TCF4, P2RX5, SPINK2, VCL, PTPN1, REL, FUT8,RPL21, PRKCB1, CSNK1E, GPR18, IGHM, ACP1, SPIB, HLA-DQA1, KRT8, FAM3C,and HLA-DMB.

In the context where cancer grade is a characteristic being determined,the cancer grade is for the cells in the biological sample. The cancergrade may refer to proliferation and differentiation characteristics ofthe cells in the biological sample and refer to a numerical grade thatis generally determined by visual observation of cells using microscopy,such as Grade 1, Grade 2, Grade 3, and Grade 4.

For example, some embodiments involve using a gene set for predictingbreast cancer grade. Genes in the gene set may be selected from thegroup consisting of: UBE2C, MYBL2, PRAME, LMNB1, CXCL9, KPNA2, TPX2,PLCH1, CCL18, CDK1, MELK, CCNB2, RRM2, CCNB1, NUSAP1, SLC7A5, TYMS,GZMK, SQLE, C1orf106, CDC25B, ATAD2, QPRT, CCNA2, NEK2, IDO1, NDC80,ZWINT, ABCA12, TOP2A, TDO2, S100A8, LAMP3, MMP1, GZMB, BIRC5, TRIP13,RACGAP1, ASPM, ESRP1, MAD2L1, CENPF, CDC20, MCM4, MKI67, PBK, CKS2,KIF2C, MRPL13, TTK, BUB1, TK1, FOXM1, CEP55, EZH2, ECT2, PRC1, CENPU,CCNE2, AURKA, HMGB3, APOBEC3B, LAGE3, CDKN3, DTL, ATP6V1C1, KIAA0101,CD2, KIF11, KIF20A, CDCA8, NCAPG, CENPN, MTFR1, MCM2, DSCC1, WDR19,SEMA3G, KCND3, SETBP1, KIF13B, NR4A2, NAV3, PDZRN3, MAGI2, CACNA1D,STC2, CHAD, PDGFD, ARMCX2, FRY, AGTR1, MARCH8, ANG, ABAT, THBD, RAI2,HSPA2, ERBB4, ECHDC2, FST, EPHX2, FOSB, STARD13, ID4, FAM129A, FCGBP,LAMA2, FGFR2, PTGER3, NME5, LRRC17, OSBPL1A, ADRA2A, LRP2, C1orf115,COL4A5, DIXDC1, KIAA1324, HPN, KLF4, SCUBE2, FMO5, SORBS2, CARD10,CITED2, MUC1, BCL2, RGS5, CYBRD1, OMD, IGFBP4, LAMB2, DUSP4, PDLIM5,IRS2, and CX3CR1.

As another example, some embodiments involve using a gene set forpredicting kidney clear cell cancer grade. Genes in the gene set may beselected from the group consisting of: PLTP, C1S, LY96, TSKU, TPST2,SERPINF1, SRPX2, SAA1, CTHRC1, GFPT2, CKAP4, SERPINA3, CFH, PLAU, BASP1,PTTG1, MOCOS, LEF1, SLPI, PRAME, STEAP3, LGALS2, CD44, FLNC, UBE2C,CTSK, SULF2, TMEM45A, FCGR1A, PLOD2, C19orf80, PDGFRL, IGF2BP3, SLC7A5,PRRX1, RARRES1, LHFPL2, KDELR3, TRIB3, IL20RB, FBLN1, KMO, C1R, CYP1B1,KIF2A, PLAUR, CKS2, CDCP1, SFRP4, HAMP, MMP9, SLC3A1, NAT8, FRMD3, NPR3,NAT8B, BBOX1, SLC5A1, GBA3, EMCN, SLC47A1, AQP1, PCK1, UGT2A3, BHMT,FMO1, ACAA2, SLC5A8, SLC16A9, TSPAN18, SLC17A3, STK32B, MAP7, MYLIP,SLC22A12, LRP2, CD34, PODXL, ZBTB42, TEK, FBP1, and BCL2.

Aspects of using statistical models for predicting tissue of origin,cancer grade, and/or any other characteristics of a biological sampleare described in the U.S. Provisional Patent Application Ser. No.62/943,976, titled “Machine Learning Techniques for Gene ExpressionAnalysis”, filed on Dec. 5, 2019, which is incorporated by referenceherein in its entirety.

Returning to aspects of evaluating quality of sequence information, insome embodiments, the quality of sequence information from one or more(e.g., at least two) nucleic acid samples is evaluated by determiningthe presence or absence of polyadenylated RNA genes to predict whetherthe sequence information was obtained from polyA RNA or not. In someembodiments, if there is disagreement between the predicted and expected(e.g., asserted) polyA status of one or more samples, the sequenceinformation for those samples is deemed as suspect, of insufficientquality, and is removed, discarded, retested, and/or reported as such toa user. In some embodiments, if there is agreement between the predictedand expected polyA status for one or more nucleic acid samples, thesequence information is deemed of sufficient quality, and is retained,and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bydetermining a complexity value of the sequence information. In someembodiments, determining a complexity value comprises determining thenumber of duplications. In some embodiments, the % duplications can bedetermined for a DNA or RNA library. In some embodiments, if a largepercentage of the library is duplicated, either a library of lowcomplexity or over-amplification of the DNA or cDNA fragments isindicated. In some instances, differences between libraries in thecomplexity or amplification indicates that certain biases in the dataare introduced (e.g., differing % GC content). In some embodiments, ifthe complexity value is less than 75%, or less than 80%, the sequenceinformation is deemed suspect, of insufficient quality, and is removed,discarded, retested, and/or reported as such to a user. In someembodiments, if the complexity value is at least 80%, or at least 85%,the sequence information is deemed of sufficient quality for furtheranalysis, and is retained, and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bypredicting a tissue source for the nucleic acid. In some embodiments, ifthere is disagreement between a predicted and an asserted tissue sourcefor the nucleic acid, the sequence information is deemed suspect, ofinsufficient quality, and is removed, discarded, retested, or reportedas such. In some embodiments, if there is agreement between thepredicted and asserted tissue sources, the sequence information isdeemed of sufficient quality for further analysis, and is retained,and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated by (a)determining a gene expression level for two different subunits of aknown protein; and (b) determining an expression ratio for the twodifferent subunits. In some embodiments, if the determined expressionratio does not match the expected expression ratio for the proteinsubunits, the sequence information is identified as suspect, ofinsufficient quality, and is removed, discarded, retested, and/orreported as such to a user. In some embodiments, if the determinedexpression ratio matches an expected expression ratio for the proteinsubunits, the sequence information is deemed of sufficient quality forfurther analysis, and is retained, and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bydetermining a Phred Score for the sequence information. In someembodiments, if the Phred Score is less than 27, the sequenceinformation is deemed suspect, of insufficient quality, and is removed,discarded, retested, and/or reported as such to a user. In someembodiments, if the Phred Score is less than 20, the sequenceinformation is removed, discarded, retested, and/or reported as such toa user. In some embodiments, if the Phred Score is more than 20 and isless than 27, the sequence information is deemed close to be removed,discarded, retested, and/or reported as such to a user. In someembodiments, if the Phred Score is at least 27, the sequence informationis deemed of sufficient quality for further analysis, and is retained,and/or reported as such to a user.

In some embodiments, the quality of sequence information from one ormore (e.g., at least two) nucleic acid samples is evaluated bydetermining a GC content for the sequence information. In someembodiments, if the GC content is at least 30%, and less than or equalto 55%, the sequence information is deemed of sufficient information forfurther analysis, and is retained, and/or reported as such to a user. Insome embodiments, if the GC content is in the range of 45-65%, thesequence information is deemed of sufficient information for furtheranalysis, and is retained, and/or reported as such to a user (i.e.,acceptable). In some embodiments, GC content of at least 50% (e.g., 50%,51%, 60%) is the target value at least for human samples.

In some embodiments, at least two (e.g., 3, 4, 5, 6, 7, 8, 9, 10, ormore) distinct methods for evaluating the quality (e.g., source and/orintegrity) of sequence information are performed. In some embodiments,the methods performed herein evaluate sequence information from amammal. In some embodiments, the mammal is human.

In some embodiments, the subject from which the sample which generatedthe sequence information has, is suspected of having, or is at risk ofhaving a disorder. In some embodiments, the disorder is cancer.

In some embodiments, a report is generated comprising the one or morefeatures or results of the methods described herein. In someembodiments, the report further comprises an analysis of the results ofthe methods described herein.

In some embodiments, the methods or processes of the disclosure may becarried out on a system or computer processor (e.g., laptop, desktop,server, or other computerized machine). The components of the system mayreside in disparate places and communicate over networks, such as localarea networks or wide area networks, or by internet protocols. Thesystem may interface with the user via a web-enable browsers andgraphical user interfaces (GUIs). In some embodiments, the system isunder the control of the user in one place. In some embodiments, thesystem is comprised of components not in one place and which may not beunder the direct control of the user. In some embodiments, theinformation of the system is stored locally.

As described herein, the term “process,” “act,” “step” or the variationsthereof that are used in computerized processes or flow charts thereincan be used interchangeably, unless indicated otherwise.

As described herein, the term “patient,” “subject,” “human subject” orthe variations thereof can be used interchangeably, unless indicatedotherwise.

FIG. 6A is a flow chart showing illustrative computerized Process 200for performing non-stranded RNA sequencing with the coding RNAenrichment. Process 200 begins at act 201, where a first sample of afirst tumor from a subject having, suspected of having, or at risk ofhaving cancer is obtained. Further aspects relating to obtaining a firstsample of a first tumor from a subject having, suspected of having, orat risk of having cancer are provided in section “Biological samples.”

Next process 200 proceeds to act 202, wherein RNA from the first sampleof the first tumor is extracted. Aspects relating to extracting RNA fromthe first sample of the first tumor are described in the section called“Extraction of DNA and/or RNA.”

Next the process 200 proceeds to act 203, where the extracted RNA isenriched for coding RNA to obtain enriched RNA. Aspects relating toenriching the extracted RNA for coding RNA to obtain enriched RNA aredescribed in the section called “RNA enrichment.”

Next process 200 proceeds to act 204, where a first library of cDNAfragments from the enriched RNA for non-stranded RNA sequencing isprepared. Aspects relating to preparing a first library of DNA fragmentsfrom the enriched RNA for non-stranded RNA sequencing are described inthe section called “Library preparation for RNA sequencing.”

Next process 200 proceeds to act 205, where non-stranded RNA sequencingis performed on the first library of cDNA fragments prepared from theenriched RNA. Aspects relating to performing non-stranded RNA sequencingon the first library of DNA fragments prepared from the enriched RNA aredescribed in the section called “RNA sequencing.” It should beappreciated that one or more acts of process 200 may be optional.

FIG. 6B is a flow chart showing computerized process 210 for identifyinga cancer treatment by obtaining bias-corrected gene expression data.Process 210 begins at act 211, where RNA expression data for a subjecthaving, suspected of having, or at risk of having cancer is obtained.Aspects relating to obtaining RNA expression data are described in thesection called “Obtaining RNA expression data.”

Next, process 210 proceeds to act 212, where genes in the RNA expressiondata are aligned to a reference and the RNA expression data isannotated. Aspects relating to aligning and annotating genes in the RNAexpression data with known sequences of the human genome to obtainannotated RNA expression data are described in the section called“Alignment and annotation.”

Next, process 210 proceeds to act 213, where non-coding transcripts fromthe annotated RNA expression data are removed to obtain filtered RNAexpression data. Aspects relating to removing non-coding transcriptsfrom the annotated RNA expression data are described in the sectiontitled “Removing non-coding transcripts.”

Next, process 210 proceeds to act 214, where the filtered RNA expressiondata is normalized to obtain gene expression data. The gene expressiondata may be in transcripts per kilobase million (TPM) format. Aspects ofnormalizing the filtered RNA expression data to gene expression data intranscripts per kilobase million (TPM) are described in the sectioncalled “Conversion to TPM and gene aggregation.”

Next, process 210 proceeds to act 215, where at least one gene thatintroduces bias in the gene expression data is identified. Aspects ofidentifying at least one gene that introduces bias in the geneexpression data are described in the section called “Removing bias.”

Next, process 210 proceeds to act 216, where expression data associatedwith the at least one gene that introduces bias is removed from the geneexpression data to obtain bias-corrected gene expression data. Aspectsof removing the expression data, associated with the at least one genethat introduces bias into the gene expression data, from the geneexpression data to obtain bias-corrected gene expression data aredescribed in the section called “Removing bias.”

Next, process 210 proceeds to act 217, where a cancer treatment for thesubject using the bias-corrected gene expression data is identified.Aspects relating to identifying a cancer treatment for the subject usingthe bias-corrected gene expression data are described in the sectioncalled “Identifying a cancer treatment.”

FIG. 6C is a flow chart showing computerized process 220 for identifyinga cancer treatment for the subject having, suspected of having, or atrisk of having cancer using the bias-corrected gene expression data.Process 220 begins at act 221, where RNA for coding RNA in a sample ofextracted RNA from a first tumor sample from a subject having, suspectedof having, or at risk of having cancer is enriched. Aspects relating toenriching RNA for coding RNA in a sample of extracted RNA are describedin the section called “Extraction of DNA and/or RNA.”

Next, process 220 proceeds to act 222, where non-stranded RNA sequencingon a first library of cDNA fragments prepared from the enriched RNA toobtain RNA expression data is performed. Aspects relating to performingnon-stranded RNA sequencing on a first library of cDNA fragmentsprepared from the enriched RNA to obtain RNA expression data aredescribed in the section called “RNA sequencing.”

Next, process 220 proceeds to act 223, where the RNA expression data isconverted to gene expression data. Next, process 220 proceeds to act224, where at least one gene that introduces bias in the gene expressiondata is identified. Next, process 220 proceeds to act 225, whereexpression data associated with the at least one gene that introducesbias is removed from the gene expression data to obtain bias-correctedgene expression data. Aspects relating to acts 223, 224, and 225 aredescribed in the section called “Removing bias.”

Next, process 220 proceeds to act 226, where a cancer treatment for thesubject using the bias-corrected gene expression data is identified.Aspects relating to identifying a cancer treatment for the subject usingthe bias-corrected gene expression data are described in the sectioncalled “Identifying a cancer treatment.”

FIG. 7 is an exemplary flow chart showing a computerized process 300 forpreparing patient samples for sequencing analysis and performingbioinformatics quality control, so that a cancer treatment suitable forthe patient or subject, from which the nucleic acids are extracted forsequencing analysis, can be obtained.

In the illustrated embodiment, the process 300 comprises obtaining afirst sample of a first tumor from a subject having, suspected ofhaving, or at risk of having cancer at act 301, extracting RNA from thefirst sample of the first tumor at act 302, enriching the RNA for codingRNA to obtain enriched RNA at act 303, preparing a first library of cDNAfragments from the enriched RNA for non-stranded RNA sequencing at act304, obtaining RNA expression data for the subject at act 305, aligningand annotating genes in the RNA expression data with known sequences ofthe human genome to obtain annotated RNA expression data at act 306,removing non-coding transcripts from the annotated RNA expression dataat act 307, converting the annotated RNA expression data to geneexpression data (e.g., in transcripts per kilobase million (TPM) format)at act 308, identifying at least one gene that introduces bias in thegene expression data at act 309, removing expression data for the atleast one gene that introduces bias from the gene expression data toobtain bias-corrected gene expression data at act 310, obtainingsequence information and asserted information at act 311, determiningone or more features from the sequence information at act 312,determining whether one or more features match asserted information atact 313, making at least one additional determination of the features atact 314, and identifying a cancer treatment for the subject using thebias-corrected gene expression data at act 315.

It should be appreciated that one or more acts of process 300 may beoptional. For example, in some embodiments, acts 301 and 303 may beperformed and act 303 is optional. In some embodiments, acts 301, 302,and 303 are all performed. In some embodiments, acts 301, 302, and 303are all omitted, whereas the remaining acts are performed. This may beuseful when the extracted, enriched RNA from the patient sample isalready available prior to the start of process 300. In someembodiments, the one or more features at act 312 comprises one or moreof the following features: source, patient, tissue type, tumor type,polyA status, MHC sequence, protein subunit ratio, complexity,contamination, coverage, exon coverage, read composition, Phred score,SNP concordance, and GC content. In some embodiments, the one or morefeatures at act 312 further comprise strandedness of RNA sequenceanalysis. In some embodiments, any one or more of the features at act312 can be determined. In some embodiments, the additional determinationof the features at act 314 can include but are not limited toconcordance value of SNPs, contamination value, polyA status, complexityvalue, Phred score, and GC content. In some embodiments, the additionaldetermination of any one or more of the features may be performed at act314. In some embodiments, any one or more of acts 303, process 307, andprocess 314 may be omitted. In some embodiments, all acts of thecomputerized process 300 may be performed.

FIG. 8 illustrates a non-limiting process pipeline 800 FIG. 8illustrates a non-limiting process pipeline 800 for processing andvalidating sequence data and asserted information associated with thesequence data for subsequent analysis (e.g., for diagnostic, prognostic,therapeutic, and/or other clinical applications). Act 801 is performedby obtaining nucleic acid data comprising sequence data and assertedinformation indicating an asserted source for the sequence data. In someembodiments, the nucleic acid data is obtained from a biological samplethat was previously processed. In some embodiments, the biologicalsample was previously obtained from a subject having, suspected ofhaving, or at risk of having cancer. In some embodiments, act 801 isperformed by obtaining nucleic acid data comprising an assertedintegrity of the sequence data. In some embodiments, act 801 can beperformed by obtaining nucleic acid data comprising sequence data andasserted information indicating an asserted source and an assertedintegrity of the sequence data. In some embodiments, the assertedinformation indicates the asserted integrity of the sequence data. Insome embodiments, the asserted information is indicative of a subjectfrom whom the nucleic acid was obtained. For example, in someembodiments, asserted information comprises MHC allele informationand/or SNP information for one or more loci of the subject. After act801, process 800 proceeds to acts 802 and 803 where the nucleic aciddata obtained at act 801 is validated. The validation comprisesprocessing the sequence data at act 802 to obtain a determined integrityand/or a determined source and determining whether the determinedintegrity and/or determined source matches the asserted integrity and/orasserted source, respectively, in act 803. The sequence data isprocessed in act 802 to obtain determined information indicating adetermined source of the sequence data in act 802 a and/or determinedinformation indicating a determined integrity of the sequence data inact 802 b. In some embodiments, act 802 a may comprise determininginformation indicative of at least one, two, three of the MHC genotypeof the subject, whether the nucleic acid data is RNA data or DNA data, atissue type of the biological sample, a tumor type of the biologicalsample, a sequencing platform used to generate the sequence data, SNPconcordance (e.g., determining whether one or more SNPs in the sequencedata match one or more SNPs in a reference sequence), and/or a whetheran RNA sample is polyA enriched. In some embodiments, act 802 b maycomprise determining a first level of a first nucleic acid encoding afirst subunit of a multimeric protein, determining a second level of asecond nucleic acid encoding a second subunit of a multimeric protein,and determining whether a ratio between the first level and the secondlevel matches an expected ratio. In some embodiments the first subunitand the second subunits are first and second CD3 subunits, first andsecond CD8 subunits, or first and second CD79 subunits. In someembodiments, the determined information indicative of the determinedintegrity is indicative of at least one, two, three of total sequencecoverage, exon coverage, chromosomal coverage, a ratio of nucleic acidsencoding two or more subunits of a multimeric protein, speciescontamination, complexity, and/or guanine (G) and cytosine (C)percentage (%) of the sequence data. In some embodiments, act 803comprises determining one or more MHC allele sequences from the sequencedata and determining whether the one or more MHC alleles sequences matchthe asserted MHC allele information for the subject. In someembodiments, determining MHC allele comprises determining sequences forsix MHC loci from the sequence data.

In act 803 the determined integrity and/or source is evaluated bydetermining whether the determined source of the sequence data matchesthe asserted source of the sequence data and/or whether the determinedintegrity of the sequence data matches the asserted integrity of thesequence data.

If the asserted and determined information match in act 803 (i.e., yes),process 800 proceeds to act 804 where the sequence data is furtherevaluated to determine whether the sequence data is indicative adiagnostic, prognostic, therapeutic, or other clinical outcome. Forexample, in some embodiments the sequence data is further processed inact 804 to provide a recommendation for a cancer treatment for a subjecthaving, suspected of having, or at risk of having cancer. In someembodiments, Act 804 is performed by determining a therapy for thesubject and the therapy is subsequently administered to the subject.

In some embodiments, a process may further comprise administering thetherapy to the subject. In some embodiments, the therapy is a cancertherapy.

In some embodiments, determining the therapy for the subject may includedetermining a plurality of gene group expression levels comprising agene group expression level for each gene group in a set of gene groups.In some embodiments, the set of gene groups comprises at least one genegroup associated with cancer malignancy, and at least one gene groupassociated with cancer microenvironment. The therapy for the subject isidentified by using the determined gene group expression levels.

If the asserted and determined information do not match in act 803(i.e., no), process 800 proceeds to 805, where one or more remedialaction(s) are performed. In some embodiments, a remedial actioncomprises generating an indication that the determined information doesnot match the asserted information, generating an indication to notprocess the sequence data in a subsequent analysis, and/or generating anindication to obtain additional sequence data and/or other informationabout the biological sample and/or the subject.

In some embodiments, a method comprises all acts illustrated in FIG. 8.However, in some embodiments, a subset of the acts is performed and anyone or more of the acts may be omitted, duplicated, and/or performed ina different order than illustrated in FIG. 8. For example, either act802 a or act 802 b is performed in act 802. For example, act 803 can beperformed twice to confirm the decision. For example, one or more actsin process 800 can be performed after the one or more remedial action(s)in act 805. In some embodiments, one or more acts of FIG. 8 areimplemented on a computer.

In some embodiments, expression levels of one or more genes in a sampleare analyzed to evaluate the origin and/or quality of the sample. Forexample, the expression of one or more genes that are known to beexpressed in a particular cell, tissue, or tumor type is evaluated todetermine whether it is at an expected expression level based on theexpected cell, tissue, or tumor that is being analyzed. Similarly, theexpression of one or more genes that are known not to be expressed (ornot highly expressed) in a particular cell, tissue, or tumor type isevaluated to determine whether it is at an expected expression levelbased on the expected cell, tissue, or tumor that is being analyzed.

In some embodiments, expression levels of one or more genes are analyzedfor each of a plurality of samples (e.g., 2, 3, 4, 5, 4-10, 1-50,50-500, or more samples). If the expression of one or more genes islower or higher than expected, this may be indicative that the qualityand/or source/origin of the data being analyzed is not what wasexpected. In some embodiments, data from a sample that has an unexpectedlevel (e.g., a lower or higher than expected level) of expression forone or more genes is excluded from further analysis. In someembodiments, new sequence information is obtained for a sample that hasan unexpected level of expression for one or more genes, for example toconfirm whether the initial data was correct. In some embodiments, asample that has an unexpected level of expression for one or more genescan be further analyzed, for example to determine whether the sample wasfrom a different source than initially indicated.

In some embodiments, expression levels for one or more genes wereanalyzed (e.g., using tSNE, PCA, or other technique) to determinewhether gene expression or patterns of gene expression were similar ordifferent in separate samples. In some embodiments, if datasetscomprising the same cell type or same tissue type did not cluster withina group, or if one or more datasets were identified as statisticallydifferent from other datasets comprising the same cells or tissue, thenthe dataset(s) identified as different could be excluded, furtheranalyzed, or flagged as potentially suspect. In some embodiments,additional sequence data can be obtained for a sample identified aspotentially suspect.

An illustrative implementation of a computer system 500 that may be usedin connection with any of the embodiments of the technology describedherein is shown in FIG. 9 The computer system 500 includes one or moreprocessors 510 and one or more articles of manufacture that comprisenon-transitory computer-readable storage media (e.g., memory 520 and oneor more non-volatile storage media 530). The processor 510 may controlwriting data to and reading data from the memory 520 and thenon-volatile storage device 530 in any suitable manner, as the aspectsof the technology described herein are not limited in this respect. Toperform any of the functionality described herein, the processor 510 mayexecute one or more processor-executable instructions stored in one ormore non-transitory computer-readable storage media (e.g., the memory520), which may serve as non-transitory computer-readable storage mediastoring processor-executable instructions for execution by the processor510.

Computing device 500 may also include a network input/output (I/O)interface 540 via which the computing device may communicate with othercomputing devices (e.g., over a network), and may also include one ormore user I/O interfaces 550, via which the computing device may provideoutput to and receive input from a user. The user I/O interfaces mayinclude devices such as a keyboard, a mouse, a microphone, a displaydevice (e.g., a monitor or touch screen), speakers, a camera, and/orvarious other types of I/O devices.

The embodiments described herein, can be implemented in any of numerousways. For example, the embodiments may be implemented using hardware,software or a combination thereof. When implemented in software, thesoftware code can be executed on any suitable processor (e.g., amicroprocessor) or collection of processors, whether provided in asingle computing device or distributed among multiple computing devices.It should be appreciated that any component or collection of componentsthat perform the functions described above can be generically consideredas one or more controllers that control the above-described functions.The one or more controllers can be implemented in numerous ways, such aswith dedicated hardware, or with general purpose hardware (e.g., one ormore processors) that is programmed using microcode or software toperform the functions recited above.

In this respect, it should be appreciated that one implementation of theembodiments described herein comprises at least one computer-readablestorage medium (e.g., RAM, ROM, EEPROM, flash memory or other memorytechnology, CD-ROM, digital versatile disks (DVD) or other optical diskstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, or other tangible, non-transitorycomputer-readable storage medium) encoded with a computer program (i.e.,a plurality of executable instructions) that, when executed on one ormore processors, performs the above-described functions of one or moreembodiments. The computer-readable medium may be transportable such thatthe program stored thereon can be loaded onto any computing device toimplement aspects of the techniques described herein. In addition, itshould be appreciated that the reference to a computer program which,when executed, performs any of the above-described functions, is notlimited to an application program running on a host computer. Rather,the terms computer program and software are used herein in a genericsense to reference any type of computer code (e.g., applicationsoftware, firmware, microcode, or any other form of computerinstruction) that can be employed to program one or more processors toimplement aspects of the techniques described herein.

Aspects of the technology described herein provide computer implementedmethods for evaluating, generating, visualizing, and/or classifyingbiological characteristic(s) of sequence information of (e.g., cancergrade, tissue of origin) of subjects (e.g., cancer patients) or thosehaving, suspected of having, or at risk of having a disorder (e.g.,cancer).

In some embodiments, a software program may provide a user with a visualrepresentation of a subject (e.g., patient)'s characteristic(s) and/orother information related to a subject (e.g., patient)'s cancer using aninteractive graphical user interface (GUI). Such a software program mayexecute in any suitable computing environment including, but not limitedto, a cloud-computing environment, a device co-located with a user(e.g., the user's laptop, desktop, smartphone, etc.), one or moredevices remote from the user (e.g., one or more servers), etc.

For example, in some embodiments, the techniques described herein may beimplemented in the illustrative environment 600 shown in FIG. 10. Asshown in FIG. 10, within illustrative environment 600, one or morebiological samples of a subject 680 may be provided to a laboratory 670.Laboratory 670 may process the biological sample(s) to obtain expressiondata (e.g., DNA, RNA, and/or protein expression data) and/or sequenceinformation and provide it, via network 610, to at least one database660 that stores information about subject (e.g., patient) 680.

Network 610 may be a wide area network (e.g., the Internet), a localarea network (e.g., a corporate Intranet), and/or any other suitabletype of network. Any of the devices shown in FIG. 10 may connect to thenetwork 610 using one or more wired links, one or more wireless links,and/or any suitable combination thereof.

In the illustrated embodiment of FIG. 10, the at least one database 620may store expression data and or sequence information for the subject(e.g., patient), medical history data for the subject (e.g., patient),test result data for the subject (e.g., patient), and/or any othersuitable information about the subject 680. Examples of stored testresult data for the subject (e.g., patient) include biopsy test results,imaging test results (e.g., MRI results), and blood test results. Theinformation stored in at least one database 620 may be stored in anysuitable format and/or using any suitable data structure(s), as aspectsof the technology described herein are not limited in this respect. Theat least one database 620 may store data in any suitable way (e.g., oneor more databases, one or more files). The at least one database 620 maybe a single database or multiple databases.

As shown in FIG. 10, illustrative environment 600 includes one or moreexternal databases 620, which may store information for patients otherthan patient 680. For example, external databases 660 may storeexpression data and/or sequence information (of any suitable type) forone or more patients, medical history data for one or more patients,test result data (e.g., imaging results, biopsy results, blood testresults) for one or more patients, demographic and/or biographicinformation for one or more patients, and/or any other suitable type ofinformation. In some embodiments, external database(s) 660 may storeinformation available in one or more publicly accessible databases suchas TCGA (The Cancer Genome Atlas), one or more databases of clinicaltrial information, and/or one or more databases maintained by commercialsequencing suppliers. The external database(s) 660 may store suchinformation in any suitable way using any suitable hardware, as aspectsof the technology described herein are not limited in this respect.

In some embodiments, the at least one database 620 and the externaldatabase(s) 660 may be the same database, may be part of the samedatabase system, or may be physically co-located, as aspects of thetechnology described herein are not limited in this respect.

For example, in some embodiments, server(s) 640 may access informationstored in database(s) 620 and/or 660 and use this information to performprocesses described herein, described with reference to FIG. 10, fordetermining one or more characteristics of a biological sample and/or ofthe sequence information.

In some embodiments, server(s) 640 may include one or multiple computingdevices. When server(s) 640 include multiple computing devices, thedevice(s) may be physically co-located (e.g., in a single room) ordistributed across multi-physical locations. In some embodiments,server(s) 640 may be part of a cloud computing infrastructure. In someembodiments, one or more server(s) 640 may be co-located in a facilityoperated by an entity (e.g., a hospital, research institution) withwhich doctor 650 is affiliated. In such embodiments, it may be easier toallow server(s) 640 to access private medical data for the patient 880.

As shown in FIG. 10, in some embodiments, the results of the analysisperformed by server(s) 640 may be provided to doctor 650 through acomputing device 630 (which may be a portable computing device, such asa laptop or smartphone, or a fixed computing device such as a desktopcomputer). The results may be provided in a written report, an e-mail, agraphical user interface, and/or any other suitable way. It should beappreciated that although in the embodiment of FIG. 10, the results areprovided to a doctor 650, in other embodiments, the results of theanalysis may be provided to patient 680 or a caretaker of patient 680, ahealthcare provider such as a nurse, or a person involved with aclinical trial.

In some embodiments, the results may be part of a graphical userinterface (GUI) presented to the doctor 650 via the computing device630. In some embodiments, the GUI may be presented to the user as partof a webpage displayed by a web browser executing on the computingdevice 630. In some embodiments, the GUI may be presented to the userusing an application program (different from a web-browser) executing onthe computing device 630. For example, in some embodiments, thecomputing device 630 may be a mobile device (e.g., a smartphone) and theGUI may be presented to the user via an application program (e.g., “anapp”) executing on the mobile device.

The GUI presented on computing device 630 may provide a wide range ofoncological data relating to both the patient and the patient's cancerin a new way that is compact and highly informative. Previously,oncological data was obtained from multiple sources of data and atmultiple times making the process of obtaining such information costlyfrom both a time and financial perspective. Using the techniques andgraphical user interfaces illustrated herein, a user can access the sameamount of information at once with less demand on the user and with lessdemand on the computing resources needed to provide such information.Low demand on the user serves to reduce clinician errors associated withsearching various sources of information. Low demand on the computingresources serves to reduce processor power, network bandwidth, andmemory needed to provide a wide range of oncological data, which is animprovement in computing technology. In some embodiments, the reports ofthe disclosure are presented to a user by means of a system or by meansof a GUI.

Accordingly, in an aspect, the disclosure relates to a method ofevaluating sequence information, to determine at least one feature. Theevaluation can take place on a computer or other automated machinecapable of carrying out programmable instructions or can be performedmanually by an evaluator. The features can be used to generate a reportfor informing the evaluator of the at least one feature of the sequenceinformation. In some embodiments, the feature is the sequence of the MHCalleles of the sequence information.

The major histocompatibility complex (MHC) (referred to as the HumanLeukocyte Antigens (HLAs) in humans) is the mechanism by which theimmune system is able to differentiate between self and nonself cells.It is a collection of glycoproteins (proteins with a carbohydrate) thatexist on the plasma membranes of nearly all body cells. The (MHC) arehighly polymorphic genes that are important in the immune system ofbiological organisms and originate from 20 genes, with more than 50variations per gene between individuals, and allow for co-dominancebetween alleles. These glycoproteins are part of a pathway which enablesthe immune system to identify self and non-self cells by aberrations inthe MHC displayed on the plasma membrane.

Due to these properties, e.g., MHCs are highly polymorphic,co-dominance, and that there are a large number of alleles that may bepresent in a given species, the MHC profile of a subject is highlyspecific and unique. Thus, it is extremely unlikely that two people,except for identical twins, will possess cells with the same set of MHCmolecules. Accordingly, by evaluating the sequence of the MHC profile ofsequence information, it can be used to corroborate, or disqualify,identifying information between the sequence information, an assertedinformation, other sequence information, or a combination thereof.

In some embodiments, one MHC allele is used for the evaluation. In someembodiments, at least two MHC alleles are used for the evaluation. Insome embodiments, at least three MHC alleles are used for theevaluation. In some embodiments, at least four MHC alleles are used forthe evaluation. In some embodiments, at least five MHC alleles are usedfor the evaluation. In some embodiments, at least six MHC alleles areused for the evaluation.

In some embodiments, the evaluated feature is a concordance value ofsingle nucleotide polymorphisms (SNPs). “SNP” or “Single NucleotidePolymorphism,” as used herein, refers to a difference in a nucleic acidsequence (e.g., genome, sequence data set) at a single nucleotide (e.g.,adenine (A), thymine (T), cytosine (C), and/or guanine (G)) sharedbetween subjects of a species or within an individual subject on pairedchromosomes. SNPs can be, or represent: changed nucleotides (e.g., Achanged to T, G changed to A, etc.), known as a substitution; removednucleotides, wherein the nucleotide is absent from the sequenceentirely, known as a deletion; or added nucleotides, wherein anadditional nucleotide is added to the sequence. SNPs can lead to changesin an encoded protein (e.g., nonsynonymous SNPs), or not (e.g.,synonymous). Further, when the SNP is nonsynonymous, it can cause achange in the encoded amino acid (e.g., missense) or cause a prematurestop codon (e.g., nonsense). Synonymous SNPs can also alter the messageof the nucleic acid sequence by influencing or changing the splicesites, transcription factor binding, and/or messenger RNA (mRNA)binding. These mutations (e.g., changes to the protein encodingabilities of the sequence) can cause a litany of effects includingdifferences in phenotypes as well as various disease types. Moreover,SNPs occur in great numbers within a subject's genome, with someestimates being that a typical genome differs from the reference humangenome at between 4 and 5 million sites, of which more than 99.9% areSNPs.

Since SNPs are encoded in nucleic acids which are part of the genome,they are passed from parent to progeny (both subjects, and within asubject when nucleic acids replicate). Accordingly, because of thisstable inheritance, and because of the large number thereof, SNPs can beused as a genetic marker of the relatedness of subjects, and also as ameasure of the identity of two nucleic acid sequences as originatingfrom the same subject. In some embodiments, the SNP concordance value isdetermined between the sequence information and a reference sequence. Insome embodiments, the SNP concordance value is determined between thesequence information and an asserted value. In some embodiments, the SNPconcordance value must be equal to, or greater than a threshold value tobe acceptable (e.g., deemed of sufficient quality and integrity) for usein further analyses. In some embodiments, the threshold value is 80%. Insome embodiments, a SNP concordance value is determined between asequence data set and a subject, wherein if the SNP concordance value isat least 70% (e.g., at least 71%, at least 72%, at least 73%, at least74%, at least 75%, at least 76%, at least 77%, at least 78%, at least79%, at least 80%, at least 81%, at least 82%, at least 83%, at least84%, at least 85%, at least 86%, at least 87%, at least 88%, at least89%, at least 90%, at least 91%, at least 92%, at least 93%, at least94%, at least 95%, at least 95.5%, at least 96%, at least 96.5%, atleast 97%, at least 97.5%, at least 98%, at least 98.5%, at least 99%,at least 99.5%, at least 99.6%, at least 99.7%, at least 99.8%, at least99.9%, at least 99.95%, at least 99.99%, at least 99.999% or more), itis deemed to be sufficiently likely to be from the subject andidentified as being from the subject. As described herein, in someembodiments, the determination of the concordance value is as describedin the present disclosure. SNP concordance can be performed by any meansavailable or known in the art, for example SNP concordance is performedby a variety of online tools such as Conpair(github.com/nygenome/Conpair) or GATK GenotypeConcordance(software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.php)or may be calculated manually. In other examples, SNP concordance can beperformed by tools as described at publicly available websites(genome.sph.umich.edu/wiki/VerifyBamID orsoftware.broadinstitute.org/cancer/cga/contest).

In some embodiments, the evaluated feature is a quality score, forexample a Phred score. A “Phred Score” (also may be known or referred toherein as a “Phred Quality Score”) as used herein, refers to a measureof the quality for the identification of nucleotides sequenced bynucleic acid sequencing systems or platforms (e.g., NGS). Phred Scoresare known in the art and are often generated from the sequencingplatform based upon several parameters (e.g., peak shape, resolution,etc.) and a score (Q) is assigned to each nucleotide base call (For adetailed review of the calculation refer to Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I.Accuracy assessment. Genome Res. 1998 March; 8(3):175-85. and Ewing B,Green P. Base-calling of automated sequencer traces using phred. Errorprobabilities. Genome Res. 1998 March; 8(3):186-94.). The Phred Score ofeach base refers to the likelihood that a nucleotide base call isincorrect (base-calling error probability (P)) and is determined by theequation Q=−10 log₁₀ P. Thus, the score (e.g., Q) indicates the basecall accuracy, for example a Phred Score of 10 indicates a 90% callaccuracy for the base in question, while a Phred Score of 40 indicates a99.99% call accuracy for the same base. In some embodiments, the PhredScore of the sequence information is determined and compared to areference value. In some embodiments, the reference value is at least27, at least 28, at least 29, at least 30, or more than 30. In someembodiments, the Phred Score is determined and compared to the PhredScore of other sequence information. In some embodiments, the PhredScore is determined and compared to an asserted score. In someembodiments, the Phred Score is used as a base level determination ofquality. In some embodiments, it is used to compare a sequence withasserted information to compare identity, as if the Phred Scores aredifferent it is unlikely they are the same sequence information or fromthe same sample or subject.

In some embodiments, the evaluated feature is a tumor type.

In some embodiments, the evaluated feature is a tissue type.

In some embodiments, the evaluated feature is the polyadenylation statusof the sequence information. “Polyadenylated” or “PolyA” as used herein,refers to the series of multiple adenosine monophosphate nucleotidesattached to the 3′-end of messenger RNA (mRNA), this occurs aftertranscription and cleavage of the 3′-end to of the transcript to free ahydroxyl. The “polyA tail,” as it is often referred, is a characteristicof fully processed mRNA and assists in various cellular processes. Forexample, the polyA tail is the binding site for a protein (e.g.,polyA-binding protein) which promotes export from the nucleus of thecell so that translation may occur, as well as effects translation andstability of mRNA. If only transcripts of protein coding mRNA arepresent, it is highly likely (e.g., indicative) that the sample thatgenerated the sequence information was generated using mRNA-Seq. In someembodiments, the polyA status indicates mRNA-Seq was not used (e.g., thewhole transcriptome was used). In some embodiments, the polyA status isevaluated against an asserted information. In some embodiments, thepolyA status is evaluated against a reference sequence. In someembodiments, the probability of the sequence information being generatedusing either mRNA-Seq or the whole transcriptome must be above athreshold. In some embodiments, the threshold is a reference value. Insome embodiments, the threshold level is 90%. In some embodiments, thethreshold value is an asserted information. In some embodiments, thesequence information is from a sample which contained primarilypolyadenylated nucleic acids. In some embodiments, the sequenceinformation is from a sample which contained both polyadenylated andnon-polyadenylated nucleic acids

In some embodiments, the evaluated feature is the GC content of thesequence information. “G/C Content” or “guanine (G)-cytosine (C)content,” as used herein, refers to the percentage of nucleotides in anucleic acid sample which are either G or C. It can be calculated bysumming all of the G and C reads of given sequence information anddividing by the total nucleotides sequenced. In some embodiments, thesequence information is evaluated and the GC content is calculated bysumming the number of base calls resulting in a G or C (e.g., G+C) inthe sequence information and dividing by the total number of base callsin the sequence information (e.g., number of nucleotides in a sequencedata set), or (G+C)/(Number of nucleotides in a sequence data set).

The GC content may also be used as a quality measure of the sequenceinformation. Many known genomes have been sequenced, along with therespective exomes, transcriptomes, and various other portions thereof(e.g., measure of specific RNA components). Moreover, many of thesesequences have been sequenced a great number of times and averages andranges for various components thereof have been generated, for example,the GC content of the human genome. The GC content of the human genomeis known to vary from approximately 35% to 60%, and has a mean value(e.g., average value) of about 41%. Accordingly, as a quality measure ifa sequence information which is identified as a human genome were to beevaluated to have a GC content of 75%, the quality of the sequenceinformation (or the sample from which it was derived) would be inquestion. As a result, the evaluated GC content can be compared to knownranges of GC contents expected for the sequence information, to a valueprovided by the sequence information donor, to a value from a databaseof such values, to additional sequence information, or to a referencerange for sequence information of a given type to ascertain if they areconsistent or if the GC content indicates a problem, which may be due todegradation, residual primers, contamination, or other complicationswith the sequence information.

Accordingly, in some embodiments, the GC content feature is used in amethod to evaluate the integrity of the sequence information bydetermining a GC content for each sequence data set being evaluated;wherein, if the GC content is either: (i) less than 30%; or greater than55%, the nucleic acid samples are deemed likely of insufficient quality,and are removed, discarded, retested, or reported as of insufficientquality, and wherein, if the GC content is: at least 30%, and less thanor equal to 55%, the nucleic acid samples are deemed of sufficientquality and are retained. Additionally, in some embodiments, the GCcontent may be calculated and compared to asserted information. The GCcontent, in some embodiments, can be used to match the assertedinformation and thus corroborate or question the identity of thesequence information as being the same sequence information asserted,being from a given sample, subject, or specific sample or subject.

In some embodiments, the evaluated feature is a ratio of protein subunitexpression (e.g., an expression ratio of nucleic acids encodingdifferent subunits of a protein). Protein expression can be measured byany means known in the art, for example expression can be determined(e.g., quantified) by counting the number of reads that mapped to eachlocus in the transcriptome. By evaluating the expression of proteinsubunits (e.g., proteins which have multiple subunits expressed bydifferent coding regions), and then calculating a ratio thereof, it ispossible to compare the ratio with a known value, reference value,threshold value, other sequence information, or an asserted information.In some embodiments, the evaluated protein subunits are from proteinsthat are present in a human sample, regardless of the presence orabsence of a certain type of cancer. Without wishing to be bound by anytheory, such protein is encoded by a housekeeping gene (e.g., a positiveor negative control of a human sample). In some embodiments, the knownvalue, reference value, or threshold value is a fixed ratio. Forexample, subunit A and subunit B of a known protein has a ratio of 1:1or 2:1. In some embodiments, the evaluated protein subunits are fromproteins that are present in a human sample having or suspected ofhaving a certain type of cancer.

In some embodiments, the ratio is compared to a known value. In someembodiments, it is compared to an asserted information. In someembodiments, it is compared to other sequence information. In someembodiments, the protein, and subunits thereof are selected for analysisdue to their properties. For example, they may degrade quickly andtherefore serve as a proxy for the stability and/or quality of thesample from which the sequence information was generated. In someembodiments, they may be selected for their variability betweensubjects, or samples, thereby allowing for comparison to corroborate ordisqualify the identity of the sequence information.

In some embodiments, the evaluated feature is the coverage value.“Coverage,” as used herein, refers to the number of unique reads of agiven nucleotide in a reconstructed sequence. As the nucleic acid issequenced, it is not sequenced in one entire read (e.g., start to finishin one pass), but rather is the result of multiple reads of portions orsegments of the nucleic acid (e.g., RNA, exome, genome) of which have anaverage length (L), wherein the whole nucleic acid when reconstructedhas an overall length of (G). As the number of reads of a nucleic acidincrease (N), the coverage will also increase. The coverage can becalculated as N×L/G. In some embodiments, the coverage value is comparedto an asserted information. In some embodiments, the coverage value iscompared to a threshold or reference value. In some embodiments, thethreshold or reference value is a statistically significant value. Insome embodiments, the coverage value is compared to other sequenceinformation. In some embodiments, the target value of the coverage fortumor is more than 150× (e.g., 170×, 190×). In some embodiments, thetarget value of the coverage for normal tissue is more than 100× (e.g.,110×, 120×, 130×). Publicly available tools can be used for determiningthe coverage value (github.com/brentp/mosdepth andbiodatageeks.org/sequila/).

The features and evaluations described herein can be evaluated and usedindividually as well as in conjunction with one another. In someembodiments, at least one feature is evaluated (e.g., at least one, atleast two, at least three, at least four, or more). In some embodiments,at least two features are evaluated (e.g., at least three, at leastfour, or more). In some embodiments, at least four features areevaluated (e.g., at least four, or more). In some embodiments, at leastfive features are evaluated (e.g., at least five, or more). In someembodiments, at least six features are evaluated (e.g., at least six, ormore). In some embodiments, at least seven features are evaluated (e.g.,at least seven, or more). In some embodiments, at least eight featuresare evaluated (e.g., at least eight, or more). In some embodiments, atleast nine features are evaluated (e.g., at least nine, or more). Insome embodiments, at least ten features are evaluated (e.g., at leastten, or more). In some embodiments, at least eleven features areevaluated (e.g., at least eleven, or more). In some embodiments, atleast twelve features are evaluated (e.g., at least twelve, or more). Insome embodiments, at least thirteen features are evaluated (e.g., atleast thirteen, or more). In some embodiments, at least fourteenfeatures are evaluated (e.g., at least fourteen, or more). In someembodiments, at least fifteen features are evaluated (e.g., at leastfifteen, or more).

The features described herein can be evaluated sequentially,simultaneously, parallel, or a combination thereof. As it can beenvisioned, the evaluations required to evaluate some features, may beuseful in evaluating additional or other features. Accordingly, wheresuch information or evaluation results are useful for otherdeterminations, it is possible to perform evaluations of multiplefeatures at once (e.g., simultaneous), or to use the information for afollow-on evaluations (e.g., sequential). Additionally, it can beenvisioned to run multiple evaluations at the same time of differentfeatures (e.g., parallel). In some embodiments, features are evaluatedsequentially. In some embodiments, features are evaluatedsimultaneously. In some embodiments, features are evaluated in parallel.In some embodiments, features are evaluated in a combination of methods(e.g., simultaneously as well as sequentially).

Identifying a Cancer Treatment

A subject's sequencing data obtained using any one of the methodsdescribed herein may be used for various clinical purposes including,but not limited to, monitoring the progress of cancer in a subject,assessing the efficacy of a treatment for cancer, identifying subjectssuitable for a particular treatment, evaluating suitability of a patientfor participating in a clinical trial and/or predicting relapse in asubject. Accordingly, described herein are diagnostic and prognosticmethods for cancer treatment based on sequencing data obtained usingmethods described herein. In some embodiments, a method to process RNAexpression data as described herein comprises identifying a cancertreatment (also referred to herein as an anti-cancer therapy) for thesubject using the bias-corrected gene expression data.

Molecular Functional Expression Signatures

In some embodiments, identifying a cancer treatment for a subjectcomprises characterizing the cancer or tumor in the subject usingbias-corrected gene expression data. In some embodiments, a cancer in asubject is characterized by determining a molecular functionalexpression signature, which may include and/or reflect informationrelating to the molecular characteristics of a tumor including tumorgenetics, pro-tumor microenvironment factors, and anti-tumor immuneresponse factors.

A “molecular functional expression signature (MFES)”, as describedherein, refers to information relating to molecular and cellularcomposition, and biological processes that are present within and/orsurrounding the tumor. In some embodiments, the MFES of a patientincludes gene express levels for each of one or more groups of genes(“gene groups”). In some embodiments, the information in the MFES may begenerated using gene expression data (e.g., bias corrected geneexpression data) for the gene groups obtained by sequencing normaland/or tumor tissue. Though other types of gene expression data may beused to generate an MFES, it should be appreciated that the inventorsrecognized that using bias-corrected gene expression data to generate amolecular functional expression signature allows the resulting MFES tomore accurately and faithfully represent the molecular functionalcharacteristics of the subject's tumor. In turn, applying an MFESdetermined from bias-corrected gene expression data to identifying acancer therapy for the subject allows for the identification of moreeffective therapies, improved ability to determine whether one or morecancer therapies will be effective if administered to the subject,improved ability to identify clinical trials in which the subject mayparticipate, and/or improvements to numerous other prognostic,diagnostic, and clinical applications.

Gene Groups

A “gene group” refers to a group of genes associated with molecularprocesses present within and/or surrounding a tumor. Examples of genegroups and techniques for determining gene group expression levels aredescribed in International PCT Publication WO2018/231771, published onDec. 20, 2018, entitled “Systems and Methods for Generating, Visualizingand Classifying Molecular Functional Profiles,” (being a publication ofPCT Application No.: PCT/US20/037017, filed Jun. 12, 2018), the entirecontents of which are incorporated herein by reference. A “gene group”may be referred to herein as a “module”.

Exemplary modules may include, but are not limited to, Majorhistocompatibility complex I (MHC I) module, Major histocompatibilitycomplex II (MHC II) module, Coactivation molecules module, Effectorcells module, Effector T cell module; Natural killer cells (NK cells)module, T cell traffic module, T cells module, B cells module, B celltraffic module, Benign B cells module, Malignant B cell marker module,M1 signatures module, Th1 signature module, Antitumor cytokines module,Checkpoint inhibition (or checkpoint molecules) module, Folliculardendritic cells module, Follicular B helper T cells module, Protumorcytokines module, Regulatory T cells (Treg) module, Treg traffic module,Myeloid-derived suppressor cells (MDSCs) module, MDSC and TAM trafficmodule, Granulocytes module, Granulocytes traffic module, Eosinophilsignature model, Neutrophil signature model, Mast cell signature module,M2 signature module, Th2 signature module, Th17 signature module,Protumor cytokines module, Complement inhibition module, Fibroblasticreticular cells module, Cancer associated fibroblasts (CAFs) module,Matrix formation (or Matrix) module, Angiogenesis module, Endotheliummodule, Hypoxia factors module, Coagulation module, Blood endotheliummodule, Lymphatic endothelium module, Proliferation rate (or Tumorproliferation rate) module, Oncogenes module, PI3K/AKT/mTOR signalingmodule, RAS/RAF/MEK signaling module, Receptor tyrosine kinasesexpression module, Growth Factors module, Tumor suppressors module,Metastasis signature module, Antimetastatic factors module, and Mutationstatus module.

In some embodiments, each of one or more gene groups in an MFES maycomprise at least two genes (e.g., at least two genes, at least threegenes, at least four genes, at least five genes, at least six genes, atleast seven genes, at least eight genes, at least nine genes, at leastten genes, or more than ten genes as shown in the following lists; insome embodiments all of the listed genes are selected from each group;and in some embodiments the numbers of genes in each selected group arenot the same.

In some embodiments, the modules in a molecular functional expressionsignature may comprise or consist of: Major histocompatibility complex I(MHC I) module, Major histocompatibility complex II (MHC II) module,Coactivation molecules module, Effector cells (or Effector T cell)module, Natural killer cells (NK cells) module, T cells module, B cellsmodule, M1 signatures module, Th1 signature module, Antitumor cytokinesmodule, Checkpoint inhibition (or checkpoint molecules) module,Regulatory T cells (Treg) module, Myeloid-derived suppressor cells(MDSCs) module, Neutrophil signature model, M2 signature module, Th2signature module, Protumor cytokines module, Complement inhibitionmodule, Cancer associated fibroblasts (CAFs) module, Angiogenesismodule, Endothelium module, Proliferation rate (or Tumor proliferationrate) module, PI3K/AKT/mTOR signaling module, RAS/RAF/MEK signalingmodule, Receptor tyrosine kinases expression module, Growth Factorsmodule, Tumor suppressors module, Metastasis signature module, andAntimetastatic factors module. The modules may additionally include: Tcell traffic module, Antitumor cytokines module, Treg traffic module,MDSC and TAM traffic module, Granulocytes or Granulocyte traffic module,Eosinophil signature model, Mast cell signature module, Th17 signaturemodule, Matrix formation (or Matrix) module, and Hypoxia factors module.Such an MFES may be determined for a subject having a solid cancer(e.g., a melanoma) and used, for example, to identify a therapy fortreating the solid cancer.

In some embodiments, the modules in a molecular functional expressionsignature may comprise or consist of: Effector cells (or Effector Tcell) module, Natural killer cells (NK cells) module, T cells module,Malignant B cell marker module, M1 signatures module, Th1 signaturemodule, Checkpoint inhibition (or checkpoint molecules) module,Follicular dendritic cells module, Follicular B helper T cells module,Protumor cytokines module, Regulatory T cells (Treg) module, Neutrophilsignature model, M2 signature module, Th2 signature module, Complementinhibition module, Fibroblastic reticular cells module, Angiogenesismodule, Blood endothelium module, Proliferation rate (or Tumorproliferation rate) module, Oncogenes module, and Tumor suppressorsmodule. The modules may additionally include: Major histocompatibilitycomplex I (MHC I) module, Major histocompatibility complex II (MHC II)module, Coactivation molecules module, B cell traffic module, Benign Bcells module, Antitumor cytokines module, Treg traffic module, Mast cellsignature module, Th17 signature module, Matrix formation (or Matrix)module, Hypoxia factors module, Coagulation module, and Lymphaticendothelium module. Such an MFES may be determined for a subject havingfollicular lymphoma and used, for example, to identify a therapy fortreating the follicular lymphoma.

In some embodiments, the gene groups in an MFES may comprise at leasttwo genes (e.g., at least two genes, at least three genes, at least fourgenes, at least five genes, at least six genes, at least seven genes, atleast eight genes, at least nine genes, at least ten genes, or more thanten genes as shown in the following lists; in some embodiments all ofthe listed genes are selected from each group; and in some embodimentsthe numbers of genes in each selected group are not the same): Majorhistocompatibility complex I (MHC I) module: HLA-A, HLA-B, HLA-C, B2M,TAP1, and TAP2; Major histocompatibility complex II (MHC II) module:HLA-DRA, HLA-DRB1, HLA-DOB, HLA-DPB2, HLA-DMA, HLA-DOA, HLA-DPA1,HLA-DPB1, HLA-DMB, HLA-DQB1, HLA-DQA1, HLA-DRB5, HLA-DQA2, HLA-DQB2, andHLA-DRB6; Coactivation molecules module: CD80, CD86, CD40, CD83,TNFRSF4, ICOSLG, CD28; Effector cells module: IFNG, GZMA, GZMB, PRF1,LCK, GZMK, ZAP70, GNLY, FASLG, TBX21, EOMES, CD8A, and CD8B; Effector Tcell module: IFNG, GZMA, GZMB, PRF1, LCK, GZMK, ZAP70, GNLY, FASLG,TBX21, EOMES, CD8A, and CD8B; Natural killer cells (NK cells) module:NKG7, CD160, CD244, NCR1, KLRC2, KLRK1, CD226, GZMH, GNLY, IFNG,KIR2DL4, KIR2DS1, KIR2DS2, KIR2DS3, KIR2DS4, KIR2DS5, EOMES, CLIC3,FGFBP2, KLRF1, and SH2D1B; T cell traffic module: CXCL9, CXCL10, CXCR3,CX3CL1, CCR7, CXCL11, CCL21, CCL2, CCL3, CCL4, and CCL5; T cells module:EOMES, TBX21, ITK, CD3D, CD3E, CD3G, TRAC, TRBC1, TRBC2, LCK, UBASH3A,TRAT1, CD5, and CD28; B cells module: CD19, MS4A1, TNFRSF13C, CD27,CD24, CR2, TNFRSF17, TNFRSF13B, CD22, CD79A, CD79B, BLK, FCRL5, PAX5,and STAP1; B cell traffic module: CXCL13 and CXCR5; Benign B cellsmodule: CD19, MS4A1, TNFRSF13C, CD27, CD24, CR2, TNFRSF17, TNFRSF13B,CD22, CD79A, CD79B, and BLK; Malignant B cell marker module: MME, CD70,CD20, CD22, and PAX5; M1 signatures module: NOS2, IL12A, IL12B, IL23A,TNF, IL1B, and SOCS3; Th1 signature module: IFNG, IL2, CD40LG, IL15,CD27, TBX21, LTA, and IL21; Antitumor cytokines module: HMGB1, TNF,IFNB1, IFNA2, CCL3, TNFSF10, and FASLG; Checkpoint inhibition (orcheckpoint molecules) module: PDCD1, CD274, CTLA4, LAG3, PDCD1LG2, BTLA,HAVCR2, and VSIR; Follicular dendritic cells module: CR1, FCGR2A,FCGR2B, FCGR2C, CR2, FCER2, CXCL13, MADCAM1, ICAM1, VCAM1, BST1, LTBR,and TNFRSF1A; Follicular B helper T cells module: CXCR5, B3GAT1, ICOS,CD40LG, CD84, IL21, BCL6, MAF, and SAP; Protumor cytokines module: IL10,TGFB1, TGFB2, TGFB3, IL22, MIF, TNFSF13B, IL6, and IL7; Regulatory Tcells (Treg) module: TGFB1, TGFB2, TGFB3, FOXP3, CTLA4, IL10, TNFRSF18,TNFR2, and TNFRSF1B; Treg traffic module: CCL17, CXCL12, CXCR4, CCR4,CCL22, CCL1, CCL2, CCL5, CXCL13, and CCL28; Myeloid-derived suppressorcells (MDSCs) module: IDO1, ARG1, IL4R, IL10, TGFB1, TGFB2, TGFB3, NOS2,CYBB, CXCR4, and CD33; MDSC and TAM traffic module: CXCL1, CXCL5, CCL2,CCL4, CCL8, CCR2, CCL3, CCL5, CSF1, and CXCL8; Granulocytes module:CXCL8, CXCL2, CXCL1, CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3, CCL26,PRG2, EPX, RNASE2, RNASE3, IL5RA, GATA1, SIGLEC8, PRG3, MPO, ELANE,PRTN3, CTSG, FCGR3B, CXCR1, CXCR2, CD177, PI3, FFAR2, PGLYRP1, CMA1,TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, and SIGLEC8; Granulocyte trafficmodule: CXCL8, CXCL2, CXCL1, CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3, andCCL26; Eosinophil signature model: PRG2, EPX, RNASE2, RNASE3, IL5RA,GATA1, SIGLEC8, and PRG3; Neutrophil signature model: MPO, ELANE, PRTN3,CTSG, FCGR3B, CXCR1, CXCR2, CD177, PI3, FFAR2, and PGLYRP1; Mast cellsignature module: CMA1, TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, andSIGLEC8; M2 signature module: IL10, VEGFA, TGFB1, IDO1, PTGES, MRC1,CSF1, LRP1, ARG1, PTGS1, MSR1, CD163, and CSF1R; Th2 signature module:IL4, IL5, IL13, IL10, IL25, and GATA3; Th17 signature module: IL17A,IL22, IL26, IL17F, IL21, and RORC; Protumor cytokines module: IL10,TGFB1, TGFB2, TGFB3, IL22, and MIF; Complement inhibition module: CFD,CFI, CD55, CD46, CR1, and CD59; Fibroblastic reticular cells module:DES, VIM, PDGFRA, PDPN, NT5E, THY1, ENG, ACTA2, LTBR, TNFRSF1A, VCAM1,ICAM1, and BST1; Cancer associated fibroblasts (CAFs) module: COL1A1,COL1A2, COL4A1, COL5A1, TGFB1, TGFB2, TGFB3, ACTA2, FGF2, FAP, LRP1,CD248, COL6A1, COL6A2, COL6A3, FBLN1, LUM, MFAP5, LGALS1, and PRELP;Matrix formation (or Matrix) module: MMP9, FN1, COL1A1, COL1A2, COL3A1,COL4A1, CA9, VTN, LGALS7, TIMP1, MMP2, MMP1, MMP3, MMP12, LGALS9, MMP7,and COL5A1; Angiogenesis module: VEGFA, VEGFB, VEGFC, PDGFC, CXCL8,CXCR2, FLT1, PIGF, CXCL5, KDR, ANGPT1, ANGPT2, TEK, VWF, CDH5, NOS3,VCAM1, MMRN1, LDHA, HIF1A, EPAS1, CA9, SPP1, LOX, SLC2A1, and LAMP3;Endothelium module: VEGFA, NOS3, KDR, FLT1, VCAM1, VWF, CDH5, MMRN1,CLEC14A, MMRN2, and ECSCR; Hypoxia factors module: LDHA, HIF1A, EPAS1,CA9, SPP1, LOX, SLC2A1, and LAMP3; Coagulation module: HPSE, SERPINE1,SERPINB2, F3, and ANXA2; Blood endothelium module: VEGFA, NOS3, KDR,FLT1, VCAM1, VWF, CDH5, and MMRN1; Lymphatic endothelium module: CCL21and CXCL12; Proliferation rate (or Tumor proliferation rate) module:MKI67, ESCO2, CETN3, CDK2, CCND1, CCNE1, AURKA, AURKB, E2F1, MYBL2,BUB1, PLK1, PRC1, CCNB1, MCM2, MCM6, CDK4, and CDK6; Oncogenes module:MDM2, MYC, AKT1, BCL2, MME, and SYK; PI3K/AKT/mTOR signaling module:PIK3CA, PIK3CB, PIK3CG, PIK3CD, AKT1, MTOR, PTEN, PRKCA, AKT2, and AKT3;RAS/RAF/MEK signaling module: BRAF, FNTA, FNTB, MAP2K1, MAP2K2, MKNK1,and MKNK2; Receptor tyrosine kinases expression module: ALK, AXL, KIT,EGFR, ERBB2, FLT3, MET, NTRK1, FGFR1, FGFR2, FGFR3, ERBB4, ERBB3,BCR-ABL, PDGFRA, PDGFRB, and ABL1; Growth Factors module: NGF, CSF3,CSF2, FGF7, IGF1, IGF2, IL7, and FGF2; Tumor suppressors module: TP53,MLL2, CREBBP, EP300, ARID1A, HIST1H1, EBF1, IRF4, IKZF3, KLHL6, PRDM1,CDKN2A, RB1, EPHA7, TNFAIP3, TNFRSF14, FAS, SHP1, SOCS1, SIK1, PTEN,DCN, MTAP, AIM2, and MITF; Metastasis signature module: ESRP1, HOXA1,SMARCA4, TWIST1, NEDD9, PAPPA, CTSL, SNAI2, and HPSE; Antimetastaticfactors module: NCAM1, CDH1, KISS1, BRMS1, ADGRG1, TCF21, PCDH10, andMITF; and Mutation status module: APC, ARID1A, ATM, ATRX, BAP1, BRAF,BRCA2, CDH1, CDKN2A, CTCF, CTNNB1, DNMT3A, EGFR, FBXW7, FLT3, GATA3,HRAS, IDH1, KRAS, MAP3K1, MTOR, NAV3, NCOR1, NF1, NOTCH1, NPM1, NRAS,PBRM1, PIK3CA, PIK3R1, PTEN, RB1, RUNX1, SETD2, STAG2, TAF1, TP53, andVHL. In certain embodiments, two or more genes from any combination ofthe listed modules may be used to generate a molecular functionalexpression signature (or a visualization thereof, termed an “MFPORTRAIT” herein) for a subject.

In some embodiments, the gene groups in an MFES may comprise at leasttwo genes (e.g., at least two genes, at least three genes, at least fourgenes, at least five genes, at least six genes, at least seven genes, atleast eight genes, at least nine genes, at least ten genes, or more thanten genes as shown in the following lists; in some embodiments all ofthe listed genes are selected from each group; and in some embodimentsthe numbers of genes in each selected group are not the same): Majorhistocompatibility complex I (MHC I) module: HLA-A, HLA-B, HLA-C, B2M,TAP1, and TAP2; Major histocompatibility complex II (MHC II) module;HLA-DRA, HLA-DRB1, HLA-DOB, HLA-DPB2, HLA-DMA, HLA-DOA, HLA-DPA1,HLA-DPB1, HLA-DMB, HLA-DQB1, HLA-DQA1, HLA-DRB5, HLA-DQA2, HLA-DQB2, andHLA-DRB6; Coactivation molecules module: CD80, CD86, CD40, CD83,TNFRSF4, ICOSLG, CD28; Effector cells (or Effector T cell) module: IFNG,GZMA, GZMB, PRF1, LCK, GZMK, ZAP70, GNLY, FASLG, TBX21, EOMES, CD8A, andCD8B; Natural killer cells (NK cells) module: NKG7, CD160, CD244, NCR1,KLRC2, KLRK1, CD226, GNLY, KIR2DL4, KIR2DS1, KIR2DS2, KIR2DS3, KIR2DS4,KIR2DS5, EOMES, CLIC3, FGFBP2, KLRF1, and SH2D1B; T cells module: TBX21,ITK, CD3D, CD3E, CD3G, TRAC, TRBC1, TRBC2, LCK, UBASH3A, TRAT1, CD5, andCD28; B cells module: CD19, MS4A1, TNFRSF13C, CD27, CD24, CR2, TNFRSF17,TNFRSF13B, CD22, CD79A, CD79B, BLK, FCRL5, PAX5, and STAP1; M1signatures module: NOS2, IL12A, IL12B, IL23A, TNF, IL1B, and SOCS3; Th1signature module: IFNG, IL2, CD40LG, IL15, CD27, TBX21, LTA, and IL21;Checkpoint inhibition (or checkpoint molecules) module: PDCD1, CD274,CTLA4, LAG3, PDCD1LG2, BTLA, HAVCR2, and VSIR; Regulatory T cells (Treg)module: TGFB1, TGFB2, TGFB3, FOXP3, CTLA4, IL10, and TNFRSF1B;Myeloid-derived suppressor cells (MDSCs) module: IDO1, ARG1, IL4R, IL10,TGFB1, TGFB2, TGFB3, NOS2, CYBB, CXCR4, and CD33; Neutrophil signaturemodel: MPO, ELANE, PRTN3, CTSG, FCGR3B, CXCR1, CXCR2, CD177, PI3, FFAR2,and PGLYRP1; M2 signature module: IL10, VEGFA, TGFB1, IDO1, PTGES, MRC1,CSF1, LRP1, ARG1, PTGS1, MSR1, CD163, and CSF1R; Th2 signature module:IL4, IL5, IL13, IL10, IL25, and GATA3; Protumor cytokines module: IL10,TGFB1, TGFB2, TGFB3, IL22, and MIF; Complement inhibition module: CFD,CFI, CD55, CD46, and CR1; Cancer associated fibroblasts (CAFs) module:COL1A1, COL1A2, COL4A1, COL5A1, TGFB1, TGFB2, TGFB3, ACTA2, FGF2, FAP,LRP1, CD248, COL6A1, COL6A2, COL6A3, FBLN1, LUM, MFAP5, and PRELP;Angiogenesis module: VEGFA, VEGFB, VEGFC, PDGFC, CXCL8, CXCR2, FLT1,PIGF, CXCL5, KDR, ANGPT1, ANGPT2, TEK, VWF, CDH5, NOS3, VCAM1, andMMRN1; Endothelium module: VEGFA, NOS3, KDR, FLT1, VCAM1, VWF, CDH5,MMRN1, CLEC14A, MMRN2, and ECSCR; Proliferation rate (or Tumorproliferation rate) module: MKI67, ESCO2, CETN3, CDK2, CCND1, CCNE1,AURKA, AURKB, E2F1, MYBL2, BUB1, PLK1, CCNB1, MCM2, MCM6, CDK4, andCDK6; PI3K/AKT/mTOR signaling module: PIK3CA, PIK3CB, PIK3CG, PIK3CD,AKT1, MTOR, PTEN, PRKCA, AKT2, and AKT3; RAS/RAF/MEK signaling module:BRAF, FNTA, FNTB, MAP2K1, MAP2K2, MKNK1, and MKNK2; Receptor tyrosinekinases expression module: ALK, AXL, KIT, EGFR, ERBB2, FLT3, MET, NTRK1,FGFR1, FGFR2, FGFR3, ERBB4, ERBB3, BCR-ABL, PDGFRA, PDGFRB, and ABL1;Growth Factors module: NGF, CSF3, CSF2, FGF7, IGF1, IGF2, IL7, and FGF2;Tumor suppressors module: TP53, SIK1, PTEN, DCN, MTAP, AIM2, RB1, andMITF; Metastasis signature module: ESRP1, HOXA1, SMARCA4, TWIST1, NEDD9,PAPPA, and HPSE; and Antimetastatic factors module: NCAM1, CDH1, KISS1,and BRMS1. In some embodiments, the gene groups may further comprise atleast two genes (e.g., at least two genes, at least three genes, atleast four genes, at least five genes, at least six genes, at leastseven genes, at least eight genes, at least nine genes, at least tengenes, or more than ten genes as shown in the following lists; in someembodiments all of the listed genes are selected from each group; and insome embodiments the numbers of genes in each selected group are not thesame): T cell traffic module: CXCL9, CXCL10, CXCR3, CX3CL1, CCR7,CXCL11, CCL21, CCL2, CCL3, CCL4, and CCL5; Antitumor cytokines module:HMGB1, TNF, IFNB1, IFNA2, CCL3, TNFSF10, and FASLG; Treg traffic module:CCL17, CXCL12, CXCR4, CCR4, CCL22, CCL1, CCL2, CCL5, CXCL13, and CCL28;MDSC and TAM traffic module: CXCL1, CXCL5, CCL2, CCL4, CCL8, CCR2, CCL3,CCL5, CSF1, and CXCL8; Granulocyte traffic module: CXCL8, CXCL2, CXCL1,CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3, and CCL26; Eosinophil signaturemodel: PRG2, EPX, RNASE2, RNASE3, IL5RA, GATA1, SIGLEC8, and PRG3; Mastcell signature module: CMA1, TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, andSIGLEC8; Th17 signature module: IL17A, IL22, IL26, IL17F, IL21, andRORC; Matrix formation (or Matrix) module: FN1, CA9, MMP1, MMP3, MMP12,LGALS9, MMP7, MMP9, COL1A1, COL1A2, COL4A1, and COL5A1; and Hypoxiafactors module: LDHA, HIF1A, EPAS1, CA9, SPP1, LOX, SLC2A1, and LAMPS.In certain embodiments, two or more genes from each of the listedmodules are included. Any of the foregoing sets of modules may be usedto generate an MFES (or a visualization thereof) for a subject with asolid cancer (e.g., melanoma).

In some embodiments, the gene groups may comprise at least two genes(e.g., at least two genes, at least three genes, at least four genes, atleast five genes, at least six genes, at least seven genes, at leasteight genes, at least nine genes, at least ten genes, or more than tengenes as shown in the following lists; in some embodiments all of thelisted genes are selected from each group; and in some embodiments thenumbers of genes in each selected group are not the same): Effector Tcell module: IFNG, GZMA, GZMB, PRF1, LCK, GZMK, ZAP70, GNLY, FASLG,TBX21, EOMES, CD8A, and CD8B; Natural killer cells (NK cells) module:NKG7, CD160, CD244, NCR1, KLRC2, KLRK1, CD226, GZMH, GNLY, IFNG,KIR2DL4, KIR2DS1, KIR2DS2, KIR2DS3, KIR2DS4, and KIR2DS5; T cellsmodule: EOMES, TBX21, ITK, CD3D, CD3E, CD3G, TRAC, TRBC1, TRBC2, LCK,UBASH3A, and TRAT1; Benign B cells module: CD19, MS4A1, TNFRSF13C, CD27,CD24, CR2, TNFRSF17, TNFRSF13B, CD22, CD79A, CD79B, and BLK; Malignant Bcell marker module: MME, CD70, CD20, CD22, and PAX5; M1 signaturesmodule: NOS2, IL12A, IL12B, IL23A, TNF, IL1B, and SOCS3; Th1 signaturemodule: IFNG, IL2, CD40LG, IL15, CD27, TBX21, LTA, and IL21; Checkpointinhibition (or checkpoint molecules) module: PDCD1, CD274, CTLA4, LAG3,PDCD1LG2, BTLA, and HAVCR2; Follicular dendritic cells module: CR1,FCGR2A, FCGR2B, FCGR2C, CR2, FCER2, CXCL13, MADCAM1, ICAM1, VCAM1, BST1,LTBR, and TNFRSF1A; Follicular B helper T cells module: CXCR5, B3GAT1,ICOS, CD40LG, CD84, IL21, BCL6, MAF, and SAP; Protumor cytokines module:IL10, TGFB1, TGFB2, TGFB3, IL22, MIF, TNFSF13B, IL6, and IL7; RegulatoryT cells (Treg) module: TGFB1, TGFB2, TGFB3, FOXP3, CTLA4, IL10,TNFRSF18, and TNFR2; Neutrophil signature model: MPO, ELANE, PRTN3, andCTSG; M2 signature module: IL10, VEGFA, TGFB1, IDO1, PTGES, MRC1, CSF1,LRP1, ARG1, PTGS1, MSR1, CD163, and CSF1R; Th2 signature module: IL4,IL5, IL13, IL10, IL25, and GATA3; Complement inhibition module: CFD,CFI, CD55, CD46, CR1, and CD59; Fibroblastic reticular cells module:DES, VIM, PDGFRA, PDPN, NT5E, THY1, ENG, ACTA2, LTBR, TNFRSF1A, VCAM1,ICAM1, and BST1; Angiogenesis module: VEGFA, VEGFB, VEGFC, PDGFC, CXCL8,CXCR2, FLT1, PIGF, CXCL5, KDR, ANGPT1, ANGPT2, TEK, VWF, and CDH5; Bloodendothelium module: VEGFA, NOS3, KDR, FLT1, VCAM1, VWF, CDH5, and MMRN1;Proliferation rate (or Tumor proliferation rate) module: MKI67, ESCO2,CETN3, CDK2, CCND1, CCNE1, AURKA, AURKB, E2F1, MYBL2, BUB1, PLK1, CCNB1,MCM2, and MCM6; Oncogenes module: MDM2, MYC, AKT1, BCL2, MME, and SYK;and Tumor suppressors module: TP53, MLL2, CREBBP, EP300, ARID1A,HIST1H1, EBF1, IRF4, IKZF3, KLHL6, PRDM1, CDKN2A, RB1, EPHA7, TNFAIP3,TNFRSF14, FAS, SHP1, and SOCS1. In some embodiments, the gene groups ofthe modules may further comprise at least two genes (e.g., at least twogenes, at least three genes, at least four genes, at least five genes,at least six genes, at least seven genes, at least eight genes, at leastnine genes, at least ten genes, or more than ten genes as shown in thefollowing lists; in some embodiments all of the listed genes areselected from each group; and in some embodiments the numbers of genesin each selected group are not the same): Coactivation molecules module:TNFRSF4 and CD28; B cell traffic module: CXCL13 and CXCR5; Antitumorcytokines module: HMGB1, TNF, IFNB1, IFNA2, CCL3, TNFSF10, FASLG; Tregtraffic module: CCL17, CCR4, CCL22, and CXCL13; Eosinophil signaturemodel: PRG2, EPX, RNASE2, RNASE3, IL5RA, GATA1, SIGLEC8, and PRG3; Mastcell signature module: CMA1, TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, andSIGLEC8; Th17 signature module: IL17A, IL22, IL26, IL17F, IL21, andRORC; Matrix formation (or Matrix) module: MMP9, FN1, COL1A1, COL1A2,COL3A1, COL4A1, CA9, VTN, LGALS7, TIMP1, and MMP2; Hypoxia factorsmodule: LDHA, HIF1A, EPAS1, CA9, SPP1, LOX, SLC2A1, and LAMP3;Coagulation module: HPSE, SERPINE1, SERPINB2, F3, and ANXA2; andLymphatic endothelium module: CCL21 and CXCL12. In certain embodiments,two or more genes from each of the listed modules are included. Any ofthe foregoing sets of modules may be used to generate an MFES (or avisualization thereof) for a subject with a follicular lymphoma.

In some embodiments, an MFES may include one or more gene groupsassociated with cancer malignancy and one or more gene groups associatedwith the cancer microenvironment. In some embodiments, the gene group(s)associated with cancer malignancy include the tumor properties genegroup. In some embodiments, the gene group(s) associated with cancermicroenvironment include the tumor-promoting immune microenvironmentgene group, the anti-tumor immune microenvironment gene group, the geneangiogenesis group, and the gene fibroblasts group.

In some embodiments, the gene groups associated with cancer malignancycomprises at least three genes from the following group (e.g., at leastthree genes, at least four genes, at least five genes, at least sixgenes, at least seven genes, at least eight genes, at least nine genes,at least ten genes, or more than ten genes are selected from each group;in some embodiments all of the listed genes are selected from eachgroup): the tumor properties group: MKI67, ESCO2, CETN3, CDK2, CCND1,CCNE1, AURKA, AURKB, CDK4, CDK6, PRC1, E2F1, MYBL2, BUB1, PLK1, CCNB1,MCM2, MCM6, PIK3CA, PIK3CB, PIK3CG, PIK3CD, AKT1, MTOR, PTEN, PRKCA,AKT2, AKT3, BRAF, FNTA, FNTB, MAP2K1, MAP2K2, MKNK1, MKNK2, ALK, AXL,KIT, EGFR, ERBB2, FLT3, MET, NTRK1, FGFR1, FGFR2, FGFR3, ERBB4, ERBB3,BCR-ABL, PDGFRA, PDGFRB, NGF, CSF3, CSF2, FGF7, IGF1, IGF2, IL7, FGF2,TP53, SIK1, PTEN, DCN, MTAP, AIM2, RB1, ESRP1, CTSL, HOXA1, SMARCA4,SNAI2, TWIST1, NEDD9, PAPPA, HPSE, KISS1, ADGRG1, BRMS1, TCF21, CDH1,PCDH10, NCAM1, MITF, APC, ARID1A, ATM, ATRX, BAP1, BRAF, BRCA2, CDH1,CDKN2A, CTCF, CTNNB1, DNMT3A, EGFR, FBXW7, FLT3, GATA3, HRAS, IDH1,KRAS, MAP3K1, MTOR, NAV3, NCOR1, NF1, NOTCH1, NPM1, NRAS, PBRM1, PIK3CA,PIK3R1, PTEN, RB1, RUNX1, SETD2, STAG2, TAF1, TP53, and VHL.

In some embodiments, the gene groups associated with cancermicroenvironment includes at least three genes from each of thefollowing groups (e.g., at least three genes, at least four genes, atleast five genes, at least six genes, at least seven genes, at leasteight genes, at least nine genes, at least ten genes, or more than tengenes are selected from each group; in some embodiments all of thelisted genes are selected from each group): the anti-tumor immunemicroenvironment group: HLA-A, HLA-B, HLA-C, B2M, TAP1, TAP2, HLA-DRA,HLA-DRB1, HLA-DOB, HLA-DPB2, HLA-DMA, HLA-DOA, HLA-DPA1, HLA-DPB1,HLA-DMB, HLA-DQB1, HLA-DQA1, HLA-DRB5, HLA-DQA2, HLA-DQB2, HLA-DRB6,CD80, CD86, CD40, CD83, TNFRSF4, ICOSLG, CD28, IFNG, GZMA, GZMB, PRF1,LCK, GZMK, ZAP70, GNLY, FASLG, TBX21, EOMES, CD8A, CD8B, NKG7, CD160,CD244, NCR1, KLRC2, KLRK1, CD226, GZMH, GNLY, IFNG, KIR2DL4, KIR2DS1,KIR2DS2, KIR2DS3, KIR2DS4, KIR2DS5, CXCL9, CXCL10, CXCR3, CX3CL1, CCR7,CXCL11, CCL21, CCL2, CCL3, CCL4, CCL5, EOMES, TBX21, ITK, CD3D, CD3E,CD3G, TRAC, TRBC1, TRBC2, LCK, UBASH3A, TRAT1, CD19, MS4A1, TNFRSF13C,CD27, CD24, CR2, TNFRSF17, TNFRSF13B, CD22, CD79A, CD79B, BLK, NOS2,IL12A, IL12B, IL23A, TNF, IL1B, SOCS3, IFNG, IL2, CD40LG, IL15, CD27,TBX21, LTA, IL21, HMGB1, TNF, IFNB1, IFNA2, CCL3, TNFSF10, and FASLG;the tumor-promoting immune microenvironment group: PDCD1, CD274, CTLA4,LAG3, PDCD1LG2, BTLA, HAVCR2, VSIR, CXCL12, TGFB1, TGFB2, TGFB3, FOXP3,CTLA4, IL10, TNFRSF1B, CCL17, CXCR4, CCR4, CCL22, CCL1, CCL2, CCL5,CXCL13, CCL28, IDOL ARG1, IL4R, IL10, TGFB1, TGFB2, TGFB3, NOS2, CYBB,CXCR4, CD33, CXCL1, CXCL5, CCL2, CCL4, CCL8, CCR2, CCL3, CCL5, CSF1,CXCL8, CXCL8, CXCL2, CXCL1, CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3,CCL26, PRG2, EPX, RNASE2, RNASE3, IL5RA, GATA1, SIGLEC8, PRG3, CMA1,TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, SIGLEC8, MPO, ELANE, PRTN3, CTSG,IL10, VEGFA, TGFB1, IDOL PTGES, MRC1, CSF1, LRP1, ARG1, PTGS1, MSR1,CD163, CSF1R, IL4, IL5, IL13, IL10, IL25, GATA3, IL10, TGFB1, TGFB2,TGFB3, IL22, MIF, CFD, CFI, CD55, CD46, and CR1; the fibroblasts group:LGALS1, COL1A1, COL1A2, COL4A1, COL5A1, TGFB1, TGFB2, TGFB3, ACTA2,FGF2, FAP, LRP1, CD248, COL6A1, COL6A2, and COL6A3; and the angiogenesisgroup: VEGFA, VEGFB, VEGFC, PDGFC, CXCL8, CXCR2, FLT1, PIGF, CXCL5, KDR,ANGPT1, ANGPT2, TEK, VWF, CDH5, NOS3, KDR, VCAM1, MMRN1, LDHA, HIF1A,EPAS1, CA9, SPP1, LOX, SLC2A1, and LAMPS. In some embodiments, anunequal number of genes may be selected from each of the listed groupsfor use. In specific embodiments, all or almost all of the listed genesare used.

In some embodiments, gene groups associated with cancer malignancy are:the proliferation rate group, the PI3K/AKT/mTOR signaling group, theRAS/RAF/MEK signaling group, the receptor tyrosine kinases expressiongroup, the tumor suppressors group, the metastasis signature group, theanti-metastatic factors group, and the mutation status group. In someembodiments, the gene groups associated with cancer microenvironmentare: the cancer associated fibroblasts group, the angiogenesis group,the antigen presentation group, the cytotoxic T and NK cells group, theB cells group, the anti-tumor microenvironment group, the checkpointinhibition group, the Treg group, the MDSC group, the granulocytesgroup, and the tumor-promotive immune group.

In some embodiments, the gene groups associated with cancer malignancycomprises at least three genes from each of the following groups (e.g.,at least three genes, at least four genes, at least five genes, at leastsix genes, at least seven genes, at least eight genes, at least ninegenes, at least ten genes, or more than ten genes are selected from eachgroup): the proliferation rate group: MKI67, ESCO2, CETN3, CDK2, CCND1,CCNE1, AURKA, AURKB, CDK4, CDK6, PRC1, E2F1, MYBL2, BUB1, PLK1, CCNB1,MCM2, and MCM6; the PI3K/AKT/mTOR signaling group: PIK3CA, PIK3CB,PIK3CG, PIK3CD, AKT1, MTOR, PTEN, PRKCA, AKT2, and AKT3; the RAS/RAF/MEKsignaling group: BRAF, FNTA, FNTB, MAP2K1, MAP2K2, MKNK1, and MKNK2; thereceptor tyrosine kinases expression group: ALK, AXL, KIT, EGFR, ERBB2,FLT3, MET, NTRK1, FGFR1, FGFR2, FGFR3, ERBB4, ERBB3, BCR-ABL, PDGFRA,and PDGFRB; the tumor suppressors group: TP53, SIK1, PTEN, DCN, MTAP,AIM2, and RB1; the metastasis signature group: ESRP1, CTSL, HOXA1,SMARCA4, SNAI2, TWIST1, NEDD9, PAPPA, and HPSE; the anti-metastaticfactors group: KISS1, ADGRG1, BRMS1, TCF21, CDH1, PCDH10, NCAM1, andMITF; and the mutation status group: APC, ARID1A, ATM, ATRX, BAP1, BRAF,BRCA2, CDH1, CDKN2A, CTCF, CTNNB1, DNMT3A, EGFR, FBXW7, FLT3, GATA3,HRAS, IDH1, KRAS, MAP3K1, MTOR, NAV3, NCOR1, NF1, NOTCH1, NPM1, NRAS,PBRM1, PIK3CA, PIK3R1, PTEN, RB1, RUNX1, SETD2, STAG2, TAF1, TP53, andVHL.

In some embodiments, the gene groups associated with cancermicroenvironment comprises at least three genes from each of thefollowing groups (e.g., at least three genes, at least four genes, atleast five genes, at least six genes, at least seven genes, at leasteight genes, at least nine genes, at least ten genes, or more than tengenes are selected from each group): the cancer associated fibroblastsgroup: LGALS1, COL1A1, COL1A2, COL4A1, COL5A1, TGFB1, TGFB2, TGFB3,ACTA2, FGF2, FAP, LRP1, CD248, COL6A1, COL6A2, and COL6A3; theangiogenesis group: VEGFA, VEGFB, VEGFC, PDGFC, CXCL8, CXCR2, FLT1,PIGF, CXCL5, KDR, ANGPT1, ANGPT2, TEK, VWF, CDH5, NOS3, KDR, VCAM1,MMRN1, LDHA, HIF1A, EPAS1, CA9, SPP1, LOX, SLC2A1, and LAMP3; theantigen presentation group: HLA-A, HLA-B, HLA-C, B2M, TAP1, TAP2,HLA-DRA, HLA-DRB1, HLA-DOB, HLA-DPB2, HLA-DMA, HLA-DOA, HLA-DPA1,HLA-DPB1, HLA-DMB, HLA-DQB1, HLA-DQA1, HLA-DRB5, HLA-DQA2, HLA-DQB2,HLA-DRB6, CD80, CD86, CD40, CD83, TNFRSF4, ICOSLG, and CD28; thecytotoxic T and NK cells group: IFNG, GZMA, GZMB, PRF1, LCK, GZMK,ZAP70, GNLY, FASLG, TBX21, EOMES, CD8A, CD8B, NKG7, CD160, CD244, NCR1,KLRC2, KLRK1, CD226, GZMH, GNLY, IFNG, KIR2DL4, KIR2DS1, KIR2DS2,KIR2DS3, KIR2DS4, KIR2DS5, CXCL9, CXCL10, CXCR3, CX3CL1, CCR7, CXCL11,CCL21, CCL2, CCL3, CCL4, CCL5, EOMES, TBX21, ITK, CD3D, CD3E, CD3G,TRAC, TRBC1, TRBC2, LCK, UBASH3A, and TRAT1; the B cells group: CD19,MS4A1, TNFRSF13C, CD27, CD24, CR2, TNFRSF17, TNFRSF13B, CD22, CD79A,CD79B, and BLK; the anti-tumor microenvironment group: NOS2, IL12A,IL12B, IL23A, TNF, IL1B, SOCS3, IFNG, IL2, CD40LG, IL15, CD27, TBX21,LTA, IL21, HMGB1, TNF, IFNB1, IFNA2, CCL3, TNFSF10, and FASLG; thecheckpoint inhibition group: PDCD1, CD274, CTLA4, LAG3, PDCD1LG2, BTLA,HAVCR2, and VSIR; the Treg group: CXCL12, TGFB1, TGFB2, TGFB3, FOXP3,CTLA4, IL10, TNFRSF1B, CCL17, CXCR4, CCR4, CCL22, CCL1, CCL2, CCL5,CXCL13, and CCL28; the MDSC group: IDO1, ARG1, IL4R, IL10, TGFB1, TGFB2,TGFB3, NOS2, CYBB, CXCR4, CD33, CXCL1, CXCL5, CCL2, CCL4, CCL8, CCR2,CCL3, CCL5, CSF1, and CXCL8; the granulocytes group: CXCL8, CXCL2,CXCL1, CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3, CCL26, PRG2, EPX, RNASE2,RNASE3, IL5RA, GATA1, SIGLEC8, PRG3, CMA1, TPSAB1, MS4A2, CPA3, IL4,IL5, IL13, SIGLEC8, MPO, ELANE, PRTN3, and CTSG; the tumor-promotiveimmune group: IL10, VEGFA, TGFB1, IDO1, PTGES, MRC1, CSF1, LRP1, ARG1,PTGS1, MSR1, CD163, CSF1R, IL4, IL5, IL13, IL10, IL25, GATA3, IL10,TGFB1, TGFB2, TGFB3, IL22, MIF, CFD, CFI, CD55, CD46, and CR1. In someembodiments, an unequal number of genes may be selected from each of thelisted groups for use. In specific embodiments, all or almost all of thelisted genes are used.

In some embodiments, the gene groups associated with cancer malignancyare: the proliferation rate group, the PI3K/AKT/mTOR signaling group,the RAS/RAF/MEK signaling group, the receptor tyrosine kinasesexpression group, the growth factors group, the tumor suppressors group,the metastasis signature group, the anti-metastatic factors group, andthe mutation status group. In some embodiments, the plurality of genegroups associated with cancer microenvironment are: the cancerassociated fibroblasts group, the angiogenesis group, the MHCI group,the MHCII group, the coactivation molecules group, the effector cellsgroup, the NK cells group, the T cell traffic group, the T cells group,the B cells group, the M1 signatures group, the Th1 signature group, theantitumor cytokines group, the checkpoint inhibition group, the Treggroup, the MDSC group, the granulocytes group, the M2 signature group,the Th2 signature group, the protumor cytokines group, and thecomplement inhibition group.

In some embodiments, the gene groups associated with cancer malignancycomprises at least three genes from each of the following groups (e.g.,at least three genes, at least four genes, at least five genes, at leastsix genes, at least seven genes, at least eight genes, at least ninegenes, at least ten genes, or more than ten genes are selected from eachgroup): the proliferation rate group: MKI67, ESCO2, CETN3, CDK2, CCND1,CCNE1, AURKA, AURKB, CDK4, CDK6, PRC1, E2F1, MYBL2, BUB1, PLK1, CCNB1,MCM2, and MCM6; the PI3K/AKT/mTOR signaling group: PIK3CA, PIK3CB,PIK3CG, PIK3CD, AKT1, MTOR, PTEN, PRKCA, AKT2, and AKT3; the RAS/RAF/MEKsignaling group: BRAF, FNTA, FNTB, MAP2K1, MAP2K2, MKNK1, and MKNK2; thereceptor tyrosine kinases expression group: ALK, AXL, KIT, EGFR, ERBB2,FLT3, MET, NTRK1, FGFR1, FGFR2, FGFR3, ERBB4, ERBB3, BCR-ABL, PDGFRA,and PDGFRB; the growth factors group: NGF, CSF3, CSF2, FGF7, IGF1, IGF2,IL7, and FGF2; the tumor suppressors group: TP53, SIK1, PTEN, DCN, MTAP,AIM2, and RB1; the metastasis signature group: ESRP1, CTSL, HOXA1,SMARCA4, SNAI2, TWIST1, NEDD9, PAPPA, and HPSE; the anti-metastaticfactors group: KISS1, ADGRG1, BRMS1, TCF21, CDH1, PCDH10, NCAM1, andMITF; and the mutation status group: APC, ARID1A, ATM, ATRX, BAP1, BRAF,BRCA2, CDH1, CDKN2A, CTCF, CTNNB1, DNMT3A, EGFR, FBXW7, FLT3, GATA3,HRAS, IDH1, KRAS, MAP3K1, MTOR, NAV3, NCOR1, NF1, NOTCH1, NPM1, NRAS,PBRM1, PIK3CA, PIK3R1, PTEN, RB1, RUNX1, SETD2, STAG2, TAF1, TP53, andVHL. In some embodiments, the plurality of gene groups associated withcancer microenvironment comprises at least three genes from each of thefollowing groups: the cancer associated fibroblasts group: LGALS1,COL1A1, COL1A2, COL4A1, COL5A1, TGFB1, TGFB2, TGFB3, ACTA2, FGF2, FAP,LRP1, CD248, COL6A1, COL6A2, and COL6A3; the angiogenesis group: VEGFA,VEGFB, VEGFC, PDGFC, CXCL8, CXCR2, FLT1, PIGF, CXCL5, KDR, ANGPT1,ANGPT2, TEK, VWF, CDH5, NOS3, KDR, VCAM1, MMRN1, LDHA, HIF1A, EPAS1,CA9, SPP1, LOX, SLC2A1, and LAMPS; the MHCI group: HLA-A, HLA-B, HLA-C,B2M, TAP1, and TAP2; the MHCII group: HLA-DRA, HLA-DRB1, HLA-DOB,HLA-DPB2, HLA-DMA, HLA-DOA, HLA-DPA1, HLA-DPB1, HLA-DMB, HLA-DQB1,HLA-DQA1, HLA-DRB5, HLA-DQA2, HLA-DQB2, and HLA-DRB6; the coactivationmolecules group: CD80, CD86, CD40, CD83, TNFRSF4, ICOSLG, and CD28; theeffector cells group: IFNG, GZMA, GZMB, PRF1, LCK, GZMK, ZAP70, GNLY,FASLG, TBX21, EOMES, CD8A, and CD8B; the NK cells group: NKG7, CD160,CD244, NCR1, KLRC2, KLRK1, CD226, GZMH, GNLY, IFNG, KIR2DL4, KIR2DS1,KIR2DS2, KIR2DS3, KIR2DS4, and KIR2DS5; the T cell traffic group: CXCL9,CXCL10, CXCR3, CX3CL1, CCR7, CXCL11, CCL21, CCL2, CCL3, CCL4, and CCL5;the T cells group: EOMES, TBX21, ITK, CD3D, CD3E, CD3G, TRAC, TRBC1,TRBC2, LCK, UBASH3A, and TRAT1; the B cells group: CD19, MS4A1,TNFRSF13C, CD27, CD24, CR2, TNFRSF17, TNFRSF13B, CD22, CD79A, CD79B, andBLK; the M1 signatures group: NOS2, IL12A, IL12B, IL23A, TNF, IL1B, andSOCS3; the Th1 signature group: IFNG, IL2, CD40LG, IL15, CD27, TBX21,LTA, and IL21; the antitumor cytokines group: HMGB1, TNF, IFNB1, IFNA2,CCL3, TNFSF10, and FASLG; the checkpoint inhibition group: PDCD1, CD274,CTLA4, LAG3, PDCD1LG2, BTLA, HAVCR2, and VSIR; the Treg group: CXCL12,TGFB1, TGFB2, TGFB3, FOXP3, CTLA4, IL10, TNFRSF1B, CCL17, CXCR4, CCR4,CCL22, CCL1, CCL2, CCL5, CXCL13, and CCL28; the MDSC group: IDO1, ARG1,IL4R, IL10, TGFB1, TGFB2, TGFB3, NOS2, CYBB, CXCR4, CD33, CXCL1, CXCL5,CCL2, CCL4, CCL8, CCR2, CCL3, CCL5, CSF1, and CXCL8; the granulocytesgroup: CXCL8, CXCL2, CXCL1, CCL11, CCL24, KITLG, CCL5, CXCL5, CCR3,CCL26, PRG2, EPX, RNASE2, RNASE3, IL5RA, GATA1, SIGLEC8, PRG3, CMA1,TPSAB1, MS4A2, CPA3, IL4, IL5, IL13, SIGLEC8, MPO, ELANE, PRTN3, andCTSG; the M2 signature group: IL10, VEGFA, TGFB1, IDOL PTGES, MRC1,CSF1, LRP1, ARG1, PTGS1, MSR1, CD163, and CSF1R; the Th2 signaturegroup: IL4, IL5, IL13, IL10, IL25, and GATA3; the protumor cytokinesgroup: IL10, TGFB1, TGFB2, TGFB3, IL22, and MIF; and the complementinhibition group: CFD, CFI, CD55, CD46, and CR1. In some embodiments, anunequal number of genes may be selected from each of the listed groupsfor use. In specific embodiments, all or almost all of the listed genesare used.

A molecular functional expression signature may include any suitablenumber of gene groups. In some embodiments, an MFES comprises at least2, at least 3, at least 4, at least 5, at least 6, at least 7, at least8, at least 9, at least 10, at least 11, at least 12, at least 13, atleast 14, at least 15, at least 16, at least 17, at least 18, at least19, at least 20, at least 21, at least 22, at least 23, at least 24, atleast 25, at least 26, at least 27, or at least 28 modules. In someembodiments, an MFES comprises up to 2, up to 3, up to 4, up to 5, up to6, up to 7, up to 8, up to 9, up to 10, up to 11, up to 12, up to 13, upto 14, up to 15, up to 16, up to 17, up to 18, up to 19, up to 20, up to21, up to 22, up to 23, up to 24, up to 25, up to 26, up to 27, or up to28 gene groups.

Tumor Microenvironment Types

The inventors have recognized that a molecular functional expressionsignature for a subject having, suspected of having, or at risk ofhaving cancer may provide valuable information about themicroenvironment of the subject's cancer. The inventors have recognizedthat a subject's MFES may be used to classify the subject'smicroenvironment as being one of multiple types. For example, in someembodiments, the MFES may be used to classify the subject'smicroenvironment as being one of four different types ofmicroenvironment (e.g., “1st MF profile” or “type A” microenvironment,“2nd MF profile” or “type B” microenvironment, “3rd MF profile” or “typeC” microenvironment, “4th MF profile” or “type D” microenvironment,which are described in International PCT Publication WO2018/231771,which is incorporated by reference herein in its entirety). In turn, theidentified microenvironment type be used to identify a cancer therapyand/or determine the effectiveness (or lack thereof) for one or morecancer therapies. Examples of identifying cancer therapies based on atype of cancer microenvironment (e.g., determined from gene groupexpression data, for example, part of a molecular functional expressionsignature or a molecular functional profile) are described inInternational PCT Publication WO2018/231771.

First MF profile cancers may also be described as“inflamed/vascularized” and/or “inflamed/fibroblast-enriched”; Second MFprofile cancers may also be described as “inflamed/non-vascularized”and/or “inflamed/non-fibroblast-enriched”; Third MF profile cancers mayalso be described as “non-inflamed/vascularized” and/or“non-inflamed/fibroblast-enriched”; and Fourth MF profile cancers mayalso be described as “non-inflamed/non-vascularized” and/or“non-inflamed/non-fibroblast-enriched” and/or “immune desert.”

In some embodiments, “inflamed” refers to the level of compositions andprocesses related to inflammation in a cancer (e.g., a tumor). In someembodiments, inflamed cancers (e.g., tumors) are highly infiltrated byimmune cells, and are highly active with regard to antigen presentationand T-cell activation. In some embodiments, “vascularized” refers to theformation of blood vessels in a cancer (e.g., a tumor). In someembodiments, vascularized cancers (e.g., tumors) comprise high levels ofcellular compositions and process related to blood vessel formation. Insome embodiments, “fibroblast enriched” refers to the level or amount offibroblasts in a cancer (e.g., a tumor). In some embodiments, fibroblastenriched tumors comprise high levels of fibroblast cells.

Predicting Therapy Response

In some embodiments, sequencing data obtained using systems and methodsdescribed herein (e.g., bias-corrected gene expression data, dataprocessed using the quality control techniques described herein, etc.)may be used for identifying subjects suitable for a particulartreatment, and/or predicting likelihood of a patient's response or lackthereof to a particular treatment and/or predicting whether a patientmay or may not have one or more adverse reactions to a particulartherapy as described in International PCT Publication WO2018/231771,published on Dec. 20, 2018, entitled “Systems and Methods forGenerating, Visualizing and Classifying Molecular Functional Profiles,”(being a publication of PCT Application No.: PCT/US2018/037017, filedJun. 12, 2018), the entire contents of which are incorporated herein byreference.

In some embodiments, sequencing data obtained as described herein (e.g.,bias-corrected gene expression data, data processed using the qualitycontrol techniques described herein, etc.) is useful for identifying asubject suitable for a particular treatment. In some embodiments,sequencing data (e.g., bias-corrected gene expression data, dataprocessed using the quality control techniques described herein, etc.)obtained as described herein is useful for predicting likelihood of apatient's response or lack thereof to a particular treatment. In someembodiments, sequencing data obtained as described herein (e.g.,bias-corrected gene expression data, data processed using the qualitycontrol techniques described herein, etc.) is useful for predictingwhether a patient may or may not have one or more adverse reactions to aparticular therapy.

In some embodiments, predicted efficacy of an immune checkpoint blockadetherapy may be determined using sequencing data obtained as describedherein (e.g., bias-corrected gene expression data, data processed usingthe quality control techniques described herein, etc.) as described inInternational PCT Publication WO2018/231772, published on Dec. 20, 2018,entitled “Systems and Methods for Identifying Responders andNon-Responders to Immune Checkpoint Blockade Therapy” (being apublication of International patent application numberPCT/US2018/037018, filed Jun. 12, 2018), the entire contents of whichare incorporated herein by reference.

In some embodiments, sequencing data obtained as described herein (e.g.,bias-corrected gene expression data, data processed using the qualitycontrol techniques described herein, etc.) is useful for determining abiomarker, a biomarker score, a normalized biomarker score, a therapyscore, and/or an impact score as described in International PCTPublication WO2018/231762, published on Dec. 20, 2018, entitled “Systemsand Methods for Identifying Cancer Treatments from Normalized BiomarkerScores” (being a publication of International patent application numberPCT/US2018/037008, filed Jun. 12, 2018), the entire contents of whichare incorporated herein by reference.

Methods of Treatment

In certain methods described herein, an effective amount of anti-cancertherapy described herein may be administered or recommended foradministration to a subject (e.g., a human) in need of the treatment viaa suitable route (e.g., intravenous administration).

The subject to be treated by the methods described herein may be a humanpatient having, suspected of having, or at risk for a cancer. Examplesof a cancer include, but are not limited to, melanoma, lung cancer,brain cancer, breast cancer, colorectal cancer, pancreatic cancer, livercancer, prostate cancer, skin cancer, kidney cancer, bladder cancer, orprostate cancer. The subject to be treated by the methods describedherein may be a mammal (e.g., may be a human). Mammals include but arenot limited to: farm animals (e.g., livestock), sport animals,laboratory animals, pets, primates, horses, dogs, cats, mice, and rats.

A subject having a cancer may be identified by routine medicalexamination, e.g., laboratory tests, biopsy, PET scans, CT scans, orultrasounds. A subject suspected of having a cancer might show one ormore symptoms of the disorder, e.g., unexplained weight loss, fever,fatigue, cough, pain, skin changes, unusual bleeding or discharge,and/or thickening or lumps in parts of the body. A subject at risk for acancer may be a subject having one or more of the risk factors for thatdisorder. For example, risk factors associated with cancer include, butare not limited to, (a) viral infection (e.g., herpes virus infection),(b) age, (c) family history, (d) heavy alcohol consumption, (e) obesity,(f) genetics, and (g) chemical or toxin exposure, and (h) tobacco use.

“An effective amount” as used herein refers to the amount of each activeagent required to confer therapeutic effect on the subject, either aloneor in combination with one or more other active agents. Effectiveamounts vary, as recognized by those skilled in the art, depending onthe particular condition being treated, the severity of the condition,the individual patient parameters including age, physical condition,size, gender and weight, the duration of the treatment, the nature ofconcurrent therapy (if any), the specific route of administration andlike factors within the knowledge and expertise of the healthpractitioner. These factors are well known to those of ordinary skill inthe art and can be addressed with no more than routine experimentation.It is generally preferred that a maximum dose of the individualcomponents or combinations thereof be used, that is, the highest safedose according to sound medical judgment. It will be understood by thoseof ordinary skill in the art, however, that a patient may insist upon alower dose or tolerable dose for medical reasons, psychological reasons,or for virtually any other reasons.

Empirical considerations, such as the half-life of a therapeuticcompound, generally contribute to the determination of the dosage. Forexample, antibodies that are compatible with the human immune system,such as humanized antibodies or fully human antibodies, may be used toprolong half-life of the antibody and to prevent the antibody beingattacked by the host's immune system. Frequency of administration may bedetermined and adjusted over the course of therapy, and is generally(but not necessarily) based on treatment, and/or suppression, and/oramelioration, and/or delay of a cancer. Alternatively, sustainedcontinuous release formulations of an anti-cancer therapeutic agent maybe appropriate. Various formulations and devices for achieving sustainedrelease are known in the art.

In some embodiments, dosages for an anti-cancer therapeutic agent asdescribed herein may be determined empirically in individuals who havebeen administered one or more doses of the anti-cancer therapeuticagent. Individuals may be administered incremental dosages of theanti-cancer therapeutic agent. To assess efficacy of an administeredanti-cancer therapeutic agent, one or more aspects of a cancer (e.g.,tumor formation, tumor growth, tumor type, MF expression signature) maybe analyzed.

Generally, for administration of any of the anti-cancer antibodiesdescribed herein, an initial candidate dosage may be about 2 mg/kg. Forthe purpose of the present disclosure, a typical daily dosage mightrange from about any of 0.1 μg/kg to 3 μg/kg to 30 μg/kg to 300 μg/kg to3 mg/kg, to 30 mg/kg to 100 mg/kg or more, depending on the factorsmentioned above. For repeated administrations over several days orlonger, depending on the condition, the treatment is sustained until adesired suppression or amelioration of symptoms occurs or untilsufficient therapeutic levels are achieved to alleviate a cancer, or oneor more symptoms thereof. An exemplary dosing regimen comprisesadministering an initial dose of about 2 mg/kg, followed by a weeklymaintenance dose of about 1 mg/kg of the antibody, or followed by amaintenance dose of about 1 mg/kg every other week. However, otherdosage regimens may be useful, depending on the pattern ofpharmacokinetic decay that the practitioner (e.g., a medical doctor)wishes to achieve. For example, dosing from one-four times a week iscontemplated. In some embodiments, dosing ranging from about 3 μg/mg toabout 2 mg/kg (such as about 3 μg/mg, about 10 μg/mg, about 30 μg/mg,about 100 μg/mg, about 300 μg/mg, about 1 mg/kg, and about 2 mg/kg) maybe used. In some embodiments, dosing frequency is once every week, every2 weeks, every 4 weeks, every 5 weeks, every 6 weeks, every 7 weeks,every 8 weeks, every 9 weeks, or every 10 weeks; or once every month,every 2 months, or every 3 months, or longer. The progress of thistherapy may be monitored by conventional techniques and assays and/or bymonitoring cancer Types A-D as described herein. The dosing regimen(including the therapeutic used) may vary over time.

When the anti-cancer therapeutic agent is not an antibody, it may beadministered at the rate of about 0.1 to 300 mg/kg of the weight of thepatient divided into one to three doses, or as described herein. In someembodiments, for an adult patient of normal weight, doses ranging fromabout 0.3 to 5.00 mg/kg may be administered. The particular dosageregimen, e.g., dose, timing, and/or repetition, will depend on theparticular subject and that individual's medical history, as well as theproperties of the individual agents (such as the half-life of the agent,and other considerations well known in the art).

For the purpose of the present disclosure, the appropriate dosage of ananti-cancer therapeutic agent will depend on the specific anti-cancertherapeutic agent(s) (or compositions thereof) employed, the type andseverity of cancer, whether the anti-cancer therapeutic agent isadministered for preventive or therapeutic purposes, previous therapy,the patient's clinical history and response to the anti-cancertherapeutic agent, and the discretion of the attending physician.Typically the clinician will administer an anti-cancer therapeuticagent, such as an antibody, until a dosage is reached that achieves thedesired result.

Administration of an anti-cancer therapeutic agent can be continuous orintermittent, depending, for example, upon the recipient's physiologicalcondition, whether the purpose of the administration is therapeutic orprophylactic, and other factors known to skilled practitioners. Theadministration of an anti-cancer therapeutic agent (e.g., an anti-cancerantibody) may be essentially continuous over a preselected period oftime or may be in a series of spaced dose, e.g., either before, during,or after developing cancer.

As used herein, the term “treating” refers to the application oradministration of a composition including one or more active agents to asubject, who has a cancer, a symptom of a cancer, or a predispositiontoward a cancer, with the purpose to cure, heal, alleviate, relieve,alter, remedy, ameliorate, improve, or affect the cancer or one or moresymptoms of the cancer, or the predisposition toward a cancer.

Alleviating a cancer includes delaying the development or progression ofthe disease, or reducing disease severity. Alleviating the disease doesnot necessarily require curative results. As used therein, “delaying”the development of a disease (e.g., a cancer) means to defer, hinder,slow, retard, stabilize, and/or postpone progression of the disease.This delay can be of varying lengths of time, depending on the historyof the disease and/or individuals being treated. A method that “delays”or alleviates the development of a disease, or delays the onset of thedisease, is a method that reduces probability of developing one or moresymptoms of the disease in a given time frame and/or reduces extent ofthe symptoms in a given time frame, when compared to not using themethod. Such comparisons are typically based on clinical studies, usinga number of subjects sufficient to give a statistically significantresult.

“Development” or “progression” of a disease means initial manifestationsand/or ensuing progression of the disease. Development of the diseasecan be detected and assessed using clinical techniques known in the art.Alternatively or in addition to the clinical techniques known in theart, development of the disease may be detectable and assessed based onthe cancer types described herein. However, development also refers toprogression that may be undetectable. For purpose of this disclosure,development or progression refers to the biological course of thesymptoms. “Development” includes occurrence, recurrence, and onset. Asused herein “onset” or “occurrence” of a cancer includes initial onsetand/or recurrence.

In some embodiments, the anti-cancer therapeutic agent (e.g., anantibody) described herein is administered to a subject in need of thetreatment at an amount sufficient to reduce cancer (e.g., tumor) growthby at least 10% (e.g., 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% orgreater). In some embodiments, the anti-cancer therapeutic agent (e.g.,an antibody) described herein is administered to a subject in need ofthe treatment at an amount sufficient to reduce cancer cell number ortumor size by at least 10% (e.g., 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%or more). In other embodiments, the anti-cancer therapeutic agent isadministered in an amount effective in altering cancer type.Alternatively, the anti-cancer therapeutic agent is administered in anamount effective in reducing tumor formation or metastasis.

Conventional methods, known to those of ordinary skill in the art ofmedicine, may be used to administer the anti-cancer therapeutic agent tothe subject, depending upon the type of disease to be treated or thesite of the disease. The anti-cancer therapeutic agent can also beadministered via other conventional routes, e.g., administered orally,parenterally, by inhalation spray, topically, rectally, nasally,buccally, vaginally or via an implanted reservoir. The term “parenteral”as used herein includes subcutaneous, intracutaneous, intravenous,intramuscular, intraarticular, intraarterial, intrasynovial,intrasternal, intrathecal, intralesional, and intracranial injection orinfusion techniques. In addition, an anti-cancer therapeutic agent maybe administered to the subject via injectable depot routes ofadministration such as using 1-, 3-, or 6-month depot injectable orbiodegradable materials and methods.

Injectable compositions may contain various carriers such as vegetableoils, dimethylactamide, dimethyformamide, ethyl lactate, ethylcarbonate, isopropyl myristate, ethanol, and polyols (e.g., glycerol,propylene glycol, liquid polyethylene glycol, and the like). Forintravenous injection, water soluble anti-cancer therapeutic agents canbe administered by the drip method, whereby a pharmaceutical formulationcontaining the antibody and a physiologically acceptable excipients isinfused. Physiologically acceptable excipients may include, for example,5% dextrose, 0.9% saline, Ringer's solution, and/or other suitableexcipients. Intramuscular preparations, e.g., a sterile formulation of asuitable soluble salt form of the anti-cancer therapeutic agent, can bedissolved and administered in a pharmaceutical excipient such asWater-for-Injection, 0.9% saline, and/or 5% glucose solution.

In one embodiment, an anti-cancer therapeutic agent is administered viasite-specific or targeted local delivery techniques. Examples ofsite-specific or targeted local delivery techniques include variousimplantable depot sources of the agent or local delivery catheters, suchas infusion catheters, an indwelling catheter, or a needle catheter,synthetic grafts, adventitial wraps, shunts and stents or otherimplantable devices, site specific carriers, direct injection, or directapplication. See, e.g., PCT Publication No. WO 00/53211 and U.S. Pat.No. 5,981,568, the contents of each of which are incorporated byreference herein for this purpose.

Targeted delivery of therapeutic compositions containing an antisensepolynucleotide, expression vector, or subgenomic polynucleotides canalso be used. Receptor-mediated DNA delivery techniques are describedin, for example, Findeis et al., Trends Biotechnol. (1993) 11:202; Chiouet al., Gene Therapeutics: Methods And Applications Of Direct GeneTransfer (J. A. Wolff, ed.) (1994); Wu et al., J. Biol. Chem. (1988)263:621; Wu et al., J. Biol. Chem. (1994) 269:542; Zenke et al., Proc.Natl. Acad. Sci. USA (1990) 87:3655; Wu et al., J. Biol. Chem. (1991)266:338. The contents of each of the foregoing are incorporated byreference herein for this purpose.

Therapeutic compositions containing a polynucleotide may be administeredin a range of about 100 ng to about 200 mg of DNA for localadministration in a gene therapy protocol. In some embodiments,concentration ranges of about 500 ng to about 50 mg, about 1 μg to about2 mg, about 5 μg to about 500 μg, and about 20 μg to about 100 μg of DNAor more can also be used during a gene therapy protocol.

Therapeutic polynucleotides and polypeptides can be delivered using genedelivery vehicles. The gene delivery vehicle can be of viral ornon-viral origin (e.g., Jolly, Cancer Gene Therapy (1994) 1:51; Kimura,Human Gene Therapy (1994) 5:845; Connelly, Human Gene Therapy (1995)1:185; and Kaplitt, Nature Genetics (1994) 6:148). The contents of eachof the foregoing are incorporated by reference herein for this purpose.Expression of such coding sequences can be induced using endogenousmammalian or heterologous promoters and/or enhancers. Expression of thecoding sequence can be either constitutive or regulated.

Viral-based vectors for delivery of a desired polynucleotide andexpression in a desired cell are well known in the art. Exemplaryviral-based vehicles include, but are not limited to, recombinantretroviruses (see, e.g., PCT Publication Nos. WO 90/07936; WO 94/03622;WO 93/25698; WO 93/25234; WO 93/11230; WO 93/10218; WO 91/02805; U.S.Pat. Nos. 5,219,740 and 4,777,127; GB Patent No. 2,200,651; and EPPatent No. 0 345 242), alphavirus-based vectors (e.g., Sindbis virusvectors, Semliki forest virus (ATCC VR-67; ATCC VR-1247), Ross Rivervirus (ATCC VR-373; ATCC VR-1246) and Venezuelan equine encephalitisvirus (ATCC VR-923; ATCC VR-1250; ATCC VR 1249; ATCC VR-532)), andadeno-associated virus (AAV) vectors (see, e.g., PCT Publication Nos. WO94/12649, WO 93/03769; WO 93/19191; WO 94/28938; WO 95/11984 and WO95/00655). Administration of DNA linked to killed adenovirus asdescribed in Curiel, Hum. Gene Ther. (1992) 3:147 can also be employed.The contents of each of the foregoing are incorporated by referenceherein for this purpose.

Non-viral delivery vehicles and methods can also be employed, including,but not limited to, polycationic condensed DNA linked or unlinked tokilled adenovirus alone (see, e.g., Curiel, Hum. Gene Ther. (1992)3:147); ligand-linked DNA (see, e.g., Wu, J. Biol. Chem. (1989)264:16985); eukaryotic cell delivery vehicles cells (see, e.g., U.S.Pat. No. 5,814,482; PCT Publication Nos. WO 95/07994; WO 96/17072; WO95/30763; and WO 97/42338) and nucleic charge neutralization or fusionwith cell membranes. Naked DNA can also be employed. Exemplary naked DNAintroduction methods are described in PCT Publication No. WO 90/11092and U.S. Pat. No. 5,580,859. Liposomes that can act as gene deliveryvehicles are described in U.S. Pat. No. 5,422,120; PCT Publication Nos.WO 95/13796; WO 94/23697; WO 91/14445; and EP Patent No. 0524968.Additional approaches are described in Philip, Mol. Cell. Biol. (1994)14:2411, and in Woffendin, Proc. Natl. Acad. Sci. (1994) 91:1581. Thecontents of each of the foregoing are incorporated by reference hereinfor this purpose.

It is also apparent that an expression vector can be used to directexpression of any of the protein-based anti-cancer therapeutic agents(e.g., anti-cancer antibody). For example, peptide inhibitors that arecapable of blocking (from partial to complete blocking) a cancer causingbiological activity are known in the art.

In some embodiments, more than one anti-cancer therapeutic agent, suchas an antibody and a small molecule inhibitory compound, may beadministered to a subject in need of the treatment. The agents may be ofthe same type or different types from each other. At least one, at leasttwo, at least three, at least four, or at least five different agentsmay be co-administered. Generally anti-cancer agents for administrationhave complementary activities that do not adversely affect each other.Anti-cancer therapeutic agents may also be used in conjunction withother agents that serve to enhance and/or complement the effectivenessof the agents. Treatment efficacy can be assessed by methods well-knownin the art, e.g., monitoring tumor growth or formation in a patientsubjected to the treatment. Alternatively or in addition to, treatmentefficacy can be assessed by monitoring tumor type over the course oftreatment (e.g., before, during, and after treatment).

Combination Therapy

Compared to monotherapies, combinations of treatment approaches showedhigher efficacy in many studies, but the choice of remedies to becombined and designing the combination therapy regimen remainspeculative. Given that the number of possible combinations is nowextremely high, there is great need for a tool that would help to selectdrugs and combinations of remedies based on objective information abouta particular patient. Use of patient specific information (e.g., apatient's sequencing data) for designing or electing a specificcombination therapy establishes a scientific basis for choosing theoptimal combination of preparations.

Also provided herein are methods of treating a cancer or recommendingtreating a cancer using any combination of anti-cancer therapeuticagents or one or more anti-cancer therapeutic agents and one or moreadditional therapies (e.g., surgery and/or radiotherapy). The termcombination therapy, as used herein, embraces administration of morethan one treatment (e.g., an antibody and a small molecule or anantibody and radiotherapy) in a sequential manner, that is, wherein eachtherapeutic agent is administered at a different time, as well asadministration of these therapeutic agents, or at least two of theagents or therapies, in a substantially simultaneous manner.

Sequential or substantially simultaneous administration of each agent ortherapy can be affected by any appropriate route including, but notlimited to, oral routes, intravenous routes, intramuscular, subcutaneousroutes, and direct absorption through mucous membrane tissues. Theagents or therapies can be administered by the same route or bydifferent routes. For example, a first agent (e.g., a small molecule)can be administered orally, and a second agent (e.g., an antibody) canbe administered intravenously.

As used herein, the term “sequential” means, unless otherwise specified,characterized by a regular sequence or order, e.g., if a dosage regimenincludes the administration of an antibody and a small molecule, asequential dosage regimen could include administration of the antibodybefore, simultaneously, substantially simultaneously, or afteradministration of the small molecule, but both agents will beadministered in a regular sequence or order. The term “separate” means,unless otherwise specified, to keep apart one from the other. The term“simultaneously” means, unless otherwise specified, happening or done atthe same time, i.e., the agents of the invention are administered at thesame time. The term “substantially simultaneously” means that the agentsare administered within minutes of each other (e.g., within 10 minutesof each other) and intends to embrace joint administration as well asconsecutive administration, but if the administration is consecutive itis separated in time for only a short period (e.g., the time it wouldtake a medical practitioner to administer two agents separately). Asused herein, concurrent administration and substantially simultaneousadministration are used interchangeably. Sequential administrationrefers to temporally separated administration of the agents or therapiesdescribed herein.

Combination therapy can also embrace the administration of theanti-cancer therapeutic agent (e.g., an antibody) in further combinationwith other biologically active ingredients (e.g., a vitamin) andnon-drug therapies (e.g., surgery or radiotherapy).

It should be appreciated that any combination of anti-cancer therapeuticagents may be used in any sequence for treating a cancer. Thecombinations described herein may be selected on the basis of a numberof factors, which include but are not limited to the effectiveness ofaltering identified tumor type, reducing tumor formation or tumorgrowth, and/or alleviating at least one symptom associated with thecancer, or the effectiveness for mitigating the side effects of anotheragent of the combination. For example, a combined therapy as providedherein may reduce any of the side effects associated with eachindividual members of the combination, for example, a side effectassociated with an administered anti-cancer agent.

In some embodiments, an anti-cancer therapeutic agent is an antibody, animmunotherapy, a radiation therapy, a surgical therapy, and/or achemotherapy.

Examples of the antibody anti-cancer agents include, but are not limitedto, alemtuzumab (Campath), trastuzumab (Herceptin), Ibritumomab tiuxetan(Zevalin), Brentuximab vedotin (Adcetris), Ado-trastuzumab emtansine(Kadcyla), blinatumomab (Blincyto), Bevacizumab (Avastin), Cetuximab(Erbitux), ipilimumab (Yervoy), nivolumab (Opdivo), pembrolizumab(Keytruda), atezolizumab (Tecentriq), avelumab (Bavencio), durvalumab(Imfinzi), and panitumumab (Vectibix).

Examples of an immunotherapy include, but are not limited to, a PD-1inhibitor or a PD-L1 inhibitor, a CTLA-4 inhibitor, adoptive celltransfer therapy, therapeutic cancer vaccines, oncolytic virus therapy,T-cell therapy, and immune checkpoint inhibitors. In some embodiments,an immunotherapy may include a chimeric antigen receptor (CAR) T-celltherapy. A CAR is designed for a T-cell and is a chimera of a signalingdomain of the T-cell receptor (TcR) complex and an antigen-recognizingdomain (e.g., a single chain fragment (scFv) of an antibody) (Enblad etal., Human Gene Therapy. 2015; 26(8):498-505). In some embodiments, anantigen binding receptor is a chimeric antigen receptor (CAR). A T cellthat expressed a CAR is referred to as a “CAR T cell.” A CAR T cellreceptor, in some embodiments, comprises a signaling domain of theT-cell receptor (TcR) complex and an antigen-recognizing domain (e.g., asingle chain fragment (scFv) of an antibody) (Enblad et al., Human GeneTherapy. 2015; 26(8):498-505).

Examples of radiation therapy include, but are not limited to, ionizingradiation, gamma-radiation, neutron beam radiotherapy, electron beamradiotherapy, proton therapy, brachytherapy, systemic radioactiveisotopes, and radiosensitizers.

Examples of a surgical therapy include, but are not limited to, acurative surgery (e.g., tumor removal surgery), a preventive surgery, alaparoscopic surgery, and a laser surgery.

Examples of the chemotherapeutic agents include, but are not limited to,Carboplatin or Cisplatin, Docetaxel, Gemcitabine, Nab-Paclitaxel,Paclitaxel, Pemetrexed, and Vinorelbine.

Additional examples of chemotherapy include, but are not limited to,Platinating agents, such as Carboplatin, Oxaliplatin, Cisplatin,Nedaplatin, Satraplatin, Lobaplatin, Triplatin, Tetranitrate,Picoplatin, Prolindac, Aroplatin and other derivatives; Topoisomerase Iinhibitors, such as Camptothecin, Topotecan, irinotecan/SN38, rubitecan,Belotecan, and other derivatives; Topoisomerase II inhibitors, such asEtoposide (VP-16), Daunorubicin, a doxorubicin agent (e.g., doxorubicin,doxorubicin hydrochloride, doxorubicin analogs, or doxorubicin and saltsor analogs thereof in liposomes), Mitoxantrone, Aclarubicin, Epirubicin,Idarubicin, Amrubicin, Amsacrine, Pirarubicin, Valrubicin, Zorubicin,Teniposide and other derivatives; Antimetabolites, such as Folic family(Methotrexate, Pemetrexed, Raltitrexed, Aminopterin, and relatives orderivatives thereof); Purine antagonists (Thioguanine, Fludarabine,Cladribine, 6-Mercaptopurine, Pentostatin, clofarabine, and relatives orderivatives thereof) and Pyrimidine antagonists (Cytarabine,Floxuridine, Azacitidine, Tegafur, Carmofur, Capacitabine, Gemcitabine,hydroxyurea, 5-Fluorouracil (5FU), and relatives or derivativesthereof); Alkylating agents, such as Nitrogen mustards (e.g.,Cyclophosphamide, Melphalan, Chlorambucil, mechlorethamine, Ifosfamide,mechlorethamine, Trofosfamide, Prednimustine, Bendamustine, Uramustine,Estramustine, and relatives or derivatives thereof); nitrosoureas (e.g.,Carmustine, Lomustine, Semustine, Fotemustine, Nimustine, Ranimustine,Streptozocin, and relatives or derivatives thereof); Triazenes (e.g.,Dacarbazine, Altretamine, Temozolomide, and relatives or derivativesthereof); Alkyl sulphonates (e.g., Busulfan, Mannosulfan, Treosulfan,and relatives or derivatives thereof); Procarbazine; Mitobronitol, andAziridines (e.g., Carboquone, Triaziquone, ThioTEPA,triethylenemalamine, and relatives or derivatives thereof); Antibiotics,such as Hydroxyurea, Anthracyclines (e.g., doxorubicin agent,daunorubicin, epirubicin and relatives or derivatives thereof);Anthracenediones (e.g., Mitoxantrone and relatives or derivativesthereof); Streptomyces family antibiotics (e.g., Bleomycin, Mitomycin C,Actinomycin, and Plicamycin); and ultraviolet light.

EXAMPLES

In order that the invention described herein may be more fullyunderstood, the following examples are set forth. The examples describedin this application are offered to illustrate the methods, compositions,and systems provided herein and are not to be construed in any way aslimiting their scope.

Example 1: Workflow for WES and RNA Sequencing

Provided below is an example of specimen collection from a subjecthaving or suspected of having cancer, DNA and/or RNA extractiontherefrom, DNA library preparation (cDNA in the case of librarypreparation from RNA), and data processing.

Specimen Collection

Prior to collection of biological samples from a subject having orsuspected of having a cancer, sufficient quantities of sterilizedinstruments, consumables, and reagents (e.g., digest buffer) wereverified.

For tumor tissue (bulk), 30 mg of tumor tissue was collected from asubject and put into a 2 ml cryogenic tube with RNA-later, the contentsof which were then snap frozen. The specimens were shipped on dry ice asneeded.

For blood samples (which are considered “normal tissue” (ornon-cancerous)), 0.5-1 mL of whole blood was collected in an EDTAvacutainer collection tube (plastic preferred) labeled with at least thesample ID and date/time collected. The vacutainer tube was then placedinto a sealed biohazard bag with absorbent materials. Whole blood inEDTA was frozen on dry ice as needed and sent to a laboratory with otherspecimen(s) as needed. FIG. 1B illustrates an embodiment of a processthat includes the sample collection process.

Creation of Single Cell Suspension for CYTOF and RNA-Seq (SCS, Optional,Validation)

The following steps were used to create single-cells suspensions (SCS)from tumor samples that were collected in 50 mL of cold L-15 medium(1×).

-   -   1) Transfer the container with tumor sample from an operating        room on ice to a biological safety hood for dissection, wherein        it takes approximately 60-90 min from surgical resection to the        bench.    -   2) Transfer the tumor sample into a 100×15 mm petri dish        containing fresh L-15 medium. Using a curved scissor, dissect        the tumor into fragments of 1-2 mm³ on a sterile petri dish with        L-15 to keep the tissue moist. To 50 mL conical tube containing        25 mL of enzyme cocktail add 0.5 gm of tumor tissue.    -   3) Place the tube on a shaker at a speed of 85 rpm for 45 min at        37° C.    -   4) After 45 min, vigorously pipette the contents using a 10 mL        pipette. Incubate for another 45 min under the same conditions.    -   5) After the incubation, filter the sample through a 70 μm cell        strainer into a new 50 mL conical tube. Using the back of a 3 mL        syringe, gently apply pressure on the cell strainer to        disaggregate any remaining tissue.    -   6) Add 25 mL of warm (37° C.) L-15 media containing 10% FBS        through the cell strainer, into the 50 mL conical tube.    -   7) Centrifuge at 300 g at room temperature for 5 min. Decant the        supernatant.    -   8) Add 10 mL of warm (37° C.) 1× eBioscience multi species RBC        lysis buffer. Incubate in the dark for 5 min at room        temperature.    -   9) After the incubation add 40 mL of cold 1×PBS to the tube.        Centrifuge at 300 g for 5 min at 4° C. Decant supernatant. Add        10 mL of cold DMEM with 10% FBS, and resuspend the pellet        gently.    -   10) Centrifuge at 300 g for 5 min at 4° C. Decant the        supernatant. Resuspend cells in 1 mL of cold L-15 with 10% FBS.    -   11) Filter the sample through a 70 um cell strainer into a new        50 mL conical tube.    -   12) Count cells using Trypan blue. Also assess viability using        MoxiFlow (4 μL cells+1960 MoxiFlow Viability Reagent, use 75 μL        to test).    -   2,000,000 cells were aliquoted into a 15 mL conical tube. Once        the cells have been pelleted (more then 2*10⁶), each lysate was        resuspended in 500-750 μl of RNAlater/RNA Protect in a 1.5 ml        microcentrifuge tube. The 1.5 ml tube was placed into a 50 mL        conical tube with tissue paper/paper towels on top to secure the        1.5 mL tubes. The 50 ml tube can then be shipped with the tumor        specimen(s) on dry ice.        DNA and RNA Extraction from Bulk Biopsy        Extraction of normal DNA and RNA. DNA from biopsy specimens was        extracted using DSP DNA Midi Kit (Qiagen®) using an automated        process on the QlAsymphony        (www.qiagen.com/us/shop/automated-solutions/sample-preparation/qiasymphony-spas-instruments/).

A minimum of 1000-2000 ng of total DNA mass in at least 100 volumes(e.g., 100-200 ng/ul in 10 μl minimum) for each DNA sample wascollected. Moreover, the extracted DNA solutions had 260/280 ratios of˜1.8.

A minimum of 1000-6000 ng of total RNA mass was collected. RNA IntegrityNumber (RIN) scores obtained via Agilent's BioAnalyzer or Tape Stationwere of at least 7.

Extraction of tumor DNA and tumor RNA. DNA and RNA were extracted from30 mg of tissue using AllPrep DNA/RNA Mini Kit by Qiagen® (using themanual process described by the manufacturer).

DNA/RNA Extraction and CYTOF for SCSs

Extraction: RNeasy Micro Plus Kit by Qiagen® (manual process) was used.A minimum of 2,000,000 cells were used for extraction. Table 1 belowshows that the RNA concentration, yield, and quality drops substantiallyif RNA is extracted from a total of less than 2 million cells. It wasfound that 2 million cells provided at least 1.8 μg of RNA, which issufficient for good quality RNAseq data (i.e., less noise and bettercorrelation between RNA expression within different isoforms of the sameprotein coding RNA). It is recommended to have more than 1 ug of RNA forbetter quality.

TABLE 1 RNA quantity and quality as a function of the number of cellsfrom which RNA is extracted. Total Sample Number of RNA ConcentrationVolume Yield ID Cells (ng/uL) (uL) (ng) RIN BG002 2 million 70.2 26 18258.5 BG020 1 million 8.0 57.5 460 8.3 BG005 0.5 million 8.2 26 213 8.7BG008 0.5 million 8.0 26 208 8.4CyTOF: Resuspend cells that are not going to be used for RNAseq (minimum5 million), in cold cell staining buffer (CSB) and place on ice inpreparation for antibody labeling.

Library Preparation, RNA Sequencing, and WES

Illumina libraries were made and subjected to quality control (e.g.,using Tapestation D1000 High Sensitivity DNA screen tape) to evaluatetheir integrity and peak size. The analysis consumed up to 1 ng libraryin 2 μL.

Whole Exome Sequencing (WES) on DNA samples (tumor tissue and germlineblood) was performed using Agilent Human All Exon V6 Capture (48.2 Mb)or Clinical Research Exome (54.6 Mb). WES Illumina deep sequencing wasperformed with standard NextSeq RNA-seq configuration, Paired-End 100 bpReads with an estimated coverage >100×.

RNA Sequence on RNA samples (tumor tissue and SCS) was performed usingIlumina TruSeq RNA Library Prep PCR enrichment of captured DNA (Poly-AmRNA-seq), non-stranded (to compared data with that of The Cancer GenomeAtlas (TCGA)) paired-end 100 bp Reads (75+75) with an estimatedcoverage >50 million paired-end reads.

PolyA Enrichment

Different RNA enrichment methods provide various enrichment of RNAtranscripts. riboRNA depletion retains 10-50% of non-coding transcripts(e.g., rRNA, miRNA, long non-coding RNA (LncRNA)) in the library. So,the percentages of protein-coding reads strongly vary depending on themethod of RNA enrichment. In clinical settings the focus was onexpression of protein coding transcripts. PolyA enrichment, compared torRNA depletion, provided more stable and controllable percent of proteincoding transcripts (FIG. 2).

Further, because PolyA enrichment was used, and it was known thatprotein-coding RNA was enriched, RNA sequencing was performed onnon-stranded RNA. FIG. 2B demonstrates that differences in RNAexpression levels of IL24, ICAM4, and GAPDH RNA seen when eitherstranded or non-stranded RNA is used for sequencing.

FASTQ Files Processing, and RNA Expression Assessment

The raw data in the NextSeq BCL file format was converted to thestandard Illumina FASTQ format. As described herein, any type of formatthat is suitable for further analysis can be used. In this example, theFASTQ data was subjected to quality control using standard qualitycontrol algorithms (e.g., FastQ Screen(www.bioinformatics.babraham.ac.uk/projects/fastqc/), RSeQC(rseqc.sourceforge.net/), and then processed to obtain expression pergene in TPM with no or minimal batch effects across samples. Data in theform of FASTQ files was delivered via a secure SFTP server or IlluminaBaseSpace.

Quality Control Steps Assuring Quality of FASTQ Files

The following are steps involved in assuring quality control of the datain FASTQ files:

(1) Remove low-quality reads. This can be performed by using anysuitable software or tool to evaluate and/or remove reads that aredeemed of low-quality such as based on positional information. In someembodiments, low-quality reads can be removed by using FILTERBYTILE(e.g., www.filterbytile.sh (from BBmap)). In some embodiments,low-quality reads (e.g., bad tiles) are removed from sequence files(e.g., FASTQ files). In some embodiments, the data analysis pipeline maybe stopped if the quality of the reads is too low for further analysiswith sufficient confidence. For example, in some embodiments, if badtiles represent greater than a threshold percentage (e.g., 50%) of thesample, the analysis pipeline is terminated.(2) Assure quality control based on various parameters. This can beperformed by using any suitable software or tool to evaluate theconfidence of the quality control. In some embodiments, quality controlcan be assured by using FastQC (e.g.,www.bioinformatics.babraham.ac.uk/projects/fastqc/). In someembodiments, quality control can be assured by reviewing read counts asa measure of the complexity of the library. In some embodiments, qualitycontrol can be assured by reviewing per base Phred quality score as ameasure of sequencing quality of the platform. In some embodiments,quality control can be assured by reviewing per tile quality score. Insome embodiments, quality control can be assured by reviewing persequence GC content to identify contamination. In some embodiments,quality control can be assured by reviewing per base sequencing contentto identify adapter and other contamination. In some embodiments,quality control can be assured by reviewing sequence duplication levelsas a measure of a quality of RNA/DNA selection and PCR. In someembodiments, quality control can be assured by reviewing adaptercontent.

In some embodiments, the data analysis pipeline may be stopped if thequality control cannot be assured for further analysis with sufficientconfidence. For example, in some embodiments, if read counts representgreater than a threshold value (e.g., >20 min) or Phred score representgreater than a threshold percentage (e.g., >50% green zone), theanalysis pipeline is terminated.

(3) Determine cross-species contamination. This can be performed byusing any suitable software or tool to evaluate the cross-speciescontamination. In some embodiments, cross-species contamination can bedetermined by using Fastq Screen (e.g.,www.bioinformatics.babraham.ac.uk/projects/fastq_screen/_build/html/index.html).In some embodiments, cross-species contamination can comprisecontamination from various species such as mouse, zebrafish, Drosophila,C. elegans, Saccharomyces, Arabidopsis, microbiome, adapters, vectors,and phiX. In some embodiments, the data analysis pipeline may be stoppedif the cross-species contamination is too severe for further analysiswith sufficient confidence. For example, in some embodiments, ifcontamination represents greater than a threshold percentage(e.g., >20%), the analysis pipeline is terminated.(4) Assure quality of the data based on various parameters. This can beperformed by using any suitable software or tool to evaluate thequality. In some embodiments, quality control can be assured by usingMosdepth (e.g., github.com/brentp/mosdepth). In some embodiments,quality can be assured by determining per chromosome coveragedistribution (as a sex prediction algorithm). In some embodiments,quality can be assured by determining of specific regions coveragedistribution (e.g., Collaborative Consensus Coding Sequence (CCDS),exons, etc.). In some embodiments, the data analysis pipeline may bestopped if the quality of the data is too low for further analysis withsufficient confidence. For example, in some embodiments, if theconfirmation of coverage of clinically important genome regions isfailed, the analysis pipeline is terminated.(5) Assure the presence and the quality of certain characteristics ofthe data. This can be performed by using any suitable software or tool.In some embodiments, the presence or quality of certain characteristicsof the data can be assured by using Picard(broadinstitute.github.io/picard/). In some embodiments, certaincharacteristics can be the number of the percentage of the duplicities.In some embodiments, certain characteristics can be mapped regions. Insome embodiments, certain characteristics can be properly pairedregions.(6) Assure quality control based on various parameters. This can beperformed by using any suitable software or tool to evaluate theconfidence of the quality control. In some embodiments, quality controlcan be assured by using RseQC (e.g., rseqc.sourceforge.net/). In someembodiments, quality control can be assured by reviewing strandednessanalysis to prove stranded or non-stranded RNA-seq protocol. In someembodiments, quality control can be assured by reviewing gene bodycoverage to detect coverage bias due to the extraction protocol(polyA/total RNA-seq) and RIN. In some embodiments, quality control canbe assured by reviewing read distribution of exons, introns,transcription end sites (TES), and transcription start sites (TSS). Insome embodiments, the data analysis pipeline may be stopped if thequality control cannot be assured for further analysis with sufficientconfidence. For example, in some embodiments, if duplicates representgreater than a threshold percentage (e.g., <60%) for RNA or less than athreshold percentage (e.g., <20%) for adapter contamination, theanalysis pipeline is terminated.(7) Check cross-individual contamination by determining concordance of apair of samples (e.g., tumor/normal from the same patient). This can beperformed by using any suitable software or tool. In some embodiments,cross-individual contamination can be determined by using Conpair (e.g.,github.com/nygenome/Conpair). In some embodiments, the data analysispipeline may be stopped if the cross-individual contamination is toosevere for further analysis with sufficient confidence. For example, insome embodiments, if normal DNA does not match tumor DNA, the analysispipeline is terminated. In some embodiments, if large cross-individualcontamination is detected, the analysis pipeline is terminated.(8) Run a tumor type classifier. This can be performed by using anysuitable software or tools. In some embodiments, a gene expression-basedclassifier can be used. For example, a gene expression-based classifiertrained on RNAseq of previously sequenced tumors of different tissuetypes can be used to classify tumor type. Examples of such classifiersare described herein and in U.S. Provisional Patent Application Ser. No.62/943,976, titled “Machine Learning Techniques for Gene ExpressionAnalysis,” filed on Dec. 5, 2019, which is incorporated by referenceherein in its entirety. In some embodiments, this allows the predictionof the tumor type from RNA-seq data on the basis of the gene expressiondata. In some embodiments, the data analysis pipeline may be stopped ifthe tumor type is a mismatch for further analysis with sufficientconfidence. For example, in some embodiments, if the asserted tumor typefrom clinicians does not match the determined tumor type, the analysispipeline is terminated.(9) Predict library type. This can be performed by using any suitablesoftware or tools. In some embodiments, RNA-seq type classifier can beused. In some embodiments, the RNA-seq type classifier can be a geneexpression-based classifier on XGboost (e.g.,xgboost.readthedocs.io/en/latest/) trained model. In some embodiments,the prediction of library type is based on expression of specific genesfrom the RNA-seq data. In some embodiments, the data analysis pipelinemay be stopped if the library type is a mismatch for further analysiswith sufficient confidence. For example, in some embodiments, if theasserted library type does not match the determined library type (e.g.,total RNA-seq, or polyA-RNA-seq), the analysis pipeline is terminated.(10) Check concordance of HLA allele. This can be performed by using anysuitable software or tools. In some embodiments, MHC allele compositioncan be determined. In some embodiments, the data analysis pipeline maybe stopped if the HLA allele is a mismatch for further analysis withsufficient confidence. For example, in some embodiments, if the HLAallele from a sample does not confirm the source of the samples, theanalysis pipeline is terminated.(11) Perform distribution analysis of expression for differenttranscripts types. This can be performed by using any suitable softwareor tools. In some embodiments, the transcripts type can be Mt rRNA, MttRNA, lincRNA, miRNA, misc RNA, protein coding, rRNA, snRNA, snoRNA,ribozyme, Ig, processed, NMD, or retained intron. In some embodiments,one or more transcripts type can be determined. In some embodiments, thedata analysis pipeline may be stopped if the transcript type is notsuitable for further analysis with sufficient confidence. For example,in some embodiments, if the transcripts represent a greater thresholdpercentage (e.g., >70% transcripts are protein-coding transcripts), theanalysis pipeline is terminated.

Alignment

Alignment can be performed by using any suitable software or tools. Forexample, a program for quantifying transcripts, for example from bulkand single-cell RNA-Seq data, using high-throughput sequencing reads(e.g., Kalliso available from Github, www.github.com, for example asdescribed in Nicolas L Bray, Harold Pimentel, Pall Melsted and LiorPachter, Near-optimal probabilistic RNA-seq quantification, NatureBiotechnology 34, 525-527 (2016), doi:10.1038/nbt.3519) was performedwith input FASTQ files. Kalisto indexing was performed based on:

a. GRCh38 genome assembly (no alt analysis) with overlapping genes fromthe PAR locus removed.

b. Gene annotation based on GENCODE V23 comprehensive annotation(regions ALL) (www.gencodegenes.org)

Files with transcript expression in TPM (Transcripts Per KilobaseMillion) were obtained thereafter.

Removing of Non-Coding Transcripts for the Data and Other Biases

The Transcripts Per Million (TPM) expression allows presentation of geneexpression in the format of concentration (in 1 million of transcripts).This allows comparison of samples with different coverage and RNAsequencing depth.

TPM uses correction of the read counts by the length of each gene inbases, so it can create a great bias in the samples with unevendistribution of non-coding transcripts after TPM calculation becausesome non-coding transcripts (miRNA, snRNA, snoRNA) have very smalltranscript length. FIG. 3 shows the biases that are created upon TPMcalculation.

To remove batches based on uneven distribution of non-coding transcriptsin RNA library, non-coding transcripts were removed from the data beforefurther RNA expression quantification.

Excluded types included:{pseudogene, polymorphic_pseudogene, processed_pseudogene,transcribed_processed_pseudogene, unitary_pseudogene,unprocessed_pseudogene,transcribed_unitary_pseudogene, IG_C_pseudogene, IG_J_pseudogene,IG_V_pseudogene,transcribed_unprocessed_pseudogene, translated_unprocessed_pseudogeneTR_J_pseudogene,

TR_V_pseudogene

snRNA, snoRNA, miRNARibozyme, rRNA, Mt_tRNA, Mt_rRNA, scaRNAretained_intron, sense_intronic, sense_overlappingnonsense_mediated_decay, non_stop_decayAntisense, lincRNA, macro_lncRNAprocessed_transcript, 3prime_overlapping_ncrnasRNA, misc_RNA, vaultRNA, TEC}Retained types:{protein_coding,

Ig (IG_C_gene, IG_D_gene, IG_J_gene, IG_V_gene) TCR (TR_C_gene,TR_D_gene, TR_J_gene, TR_V_gene)}

In addition to removing non-coding transcripts, genes that were found tohave the highest variance between PolyA RNA sequencing and total RNAsequencing were also removed. Such genes included (1) histone-encodinggenes, and (2) mitochondria-related genes, having very long or veryshort PolyA tails, which result in uneven enrichment of the transcripts.

FIG. 4A shows the variation in the length of PolyA tails for differenthistone-encoding genes. FIG. 4B shows a comparison of expression ofhistone coding and mitochondrial genes within samples in which RNA wasenriched by either polyA enrichment or by ribo-RNA depletion (totalRNA). Genes that are excluded are described in the present disclosure(e.g., transcripts from protein non-coding regions, histone-encodinggenes, and mitochondria-related genes).

Gene Aggregation and TPM Normalization

Expression per gene was calculated as a sum of the expression of thetranscripts for the gene. Gene expression data was normalized to thetotal number of transcripts (in the million). This procedure allowscorrection for major batch effects associated with library preparation,uneven RNA transcript distribution between samples, and correction forRNA enrichment method (FIG. 5).

Example 2: DNA and RNA Extraction from Peripheral Blood MononuclearCells (PBMC) or Cell Suspensions

To prepare nucleic acid materials for downstream sequencing analysis,DNA and/or RNA was extracted from a single PBMC cell pellet or suitablecell suspensions. In brief, AllPrep DNA/RNA assay kits (Qiagen®) wereused for purifying genomic DNA and total RNA simultaneously from asingle biological sample. Biological samples were first lysed andhomogenized in a highly denaturing guanidine-isothiocyanate-containingbuffer, which immediately inactivated DNases and RNases to ensureisolation of intact DNA and RNA. The lysate was then passed through anAllPrep DNA spin column. This column, in combination with the high-saltbuffer, allowed selective and efficient binding of genomic DNA. Thecolumn was washed and the DNA was then eluted. Alternatively, the lysatethat passed through the AllPrep DNA spin column went through an RNeasyspin column to selectively isolate RNA.

In some circumstances, for further improving the quality of the startingRNA, ethanol was added to the flow-through from the AllPrep DNA spincolumn to provide appropriate binding conditions for RNA. The sample wasthen applied to an RNeasy spin column, where total RNA was bound to themembrane and contaminants were washed away. High-quality RNA was theneluted in water. Some of the steps and/or the entire procedure can bemanaged and conducted by lab personnel. In the event that qualitycontrol related issues arise, lab personnel will notify the provider(e.g., healthcare provider) of the cells or tissue (e.g., PMBCs or cellsuspensions).

Preparation of Reagents

Reagents for the extraction of DNA and/or RNA from a sample wereprepared according to the manufacturer's instruction, including AllPrepDNA/RNA Mini handbook and AllPrep DNA/RNA Micro handbook, the contentsof which are incorporated by reference herein. Some of the processes maybe customized based on the requirements of the nucleic acids of a givensequencing platform. In general, B-mercaptoethanol ((3-ME) was added toBuffer RLT Plus before use. 10 μL β-ME per 1 mL Buffer RLT Plus wasadded. The lab personnel who conducted the preparation of the reagentswore appropriate Personal Protective Equipment (PPE) and the reagentswere dispensed in a fume hood. Buffer RLT Plus was generally stable atroom temperature for 1 month after addition of (β-ME. The date of theaddition of β-ME and the 1-month expiration date were marked on thebottle.

Buffer RPE, Buffer AW1, and Buffer AW2 were each supplied as aconcentrate by the manufacturer. Before using for the first time,appropriate volume of 100% ethanol was added, as indicated on thebottle, to obtain a working solution. The solutions were appropriatelylabeled per the Solutions and Reagent Labeling Standard OperatingProcedure (SOP) as described herein. Buffer RLT Plus may form aprecipitate during storage. If necessary, the precipitates formed inBuffer RLT Plus were dissolved by warming in a 37° C. water bath untilprecipitates were dissolved. The Buffer RLT Plus without precipitateswas then place at room temperature. Prolonged incubation in the waterbath was not recommended. It was noted that Buffer RLT Plus, Buffer RW1,and Buffer AW1 contained a guanidine salt.

Preparation of Material for Extraction

Before the start of extraction, tubes and columns were labeled with aspecimen ID for each sample being processed. Frozen cell pellets werethawed slightly, so they were dislodged by flicking the tube. Celllysates were incubated at 37° C. in a water bath until completelythawed. Prolonged incubation was discouraged, due to its potentialcompromise on RNA integrity. For pelleted cells, the cell pellet wasloosened thoroughly by flicking the tube. This was an important step forproperly preparing the nucleic acid materials because incompleteloosening of the cell pellet may lead to inefficient lysis and reducednucleic acid yields. Appropriate volume of Buffer RLT Plus was added,followed by vortexing or pipetting to mix. In general, for <5×10⁵ cells,350 μL Buffer RLT Plus was added. For 5×10⁵-1×10⁷ cells, 600 μL BufferRLT Plus was added.

The lysate was homogenized by using QlAshredder. In brief, the lysatewas pipetted directly into a QlAshredder spin column placed in a 2 mLcollection tube. The lysate was then pipetted and centrifuged for 2minutes at maximum speed (18,565×g). The homogenized lysate wastransferred to an AllPrep DNA spin column placed in a 2 mL collectiontube. The lid was closed gently, and the spin column was centrifuged for30s at ≥8000×g. After centrifugation, any remaining liquid on the columnmembrane was checked and removed. If necessary, the centrifugation stepwas repeated until all liquid was passed through the membrane. TheAllPrep DNA spin column was placed in a new 2 ml collection tube and wasstored at room temperature or at 4° C. (not in the freezer) for laterDNA purification. The flow-through for RNA purification was used.

Total RNA Purification

For purifying RNA, 600 μl, of 70% ethanol was added to the flow throughfrom the previous step, and was mixed well by pipetting. Up to 700 μL ofthe sample was immediately transferred, including any precipitate thatwas formed which might be visible, to an RNeasy spin column placed in a2 ml collection tube. The lid of the collection tube was closed gentlyand was centrifuged for 15 s at ≥8000×g. The flow-through was discarded.In the event that the sample volume exceeded 700 μL, successive aliquotswere centrifuged in the same RNeasy spin column. The flow-through wasdiscarded after each centrifugation. The collection tube was re-used inthe following step.

700 μL Buffer RW1 was added to the RNeasy spin column. The lid wasclosed gently, and was centrifuged for 15 s at ≥8000×g to wash the spincolumn membrane. The flow-through was discarded. The collection tube wasre-used in the following step. 500 μL Buffer RPE was added to the RNeasyspin column. The lid was closed gently, and was centrifuged for 15 s at≥8000×g to wash the spin column membrane. The flow-through wasdiscarded.

In general, if <5×10⁵ cells were processed, 500 μL of 80% ethanol wasadded to the RNeasy MinElute spin column. The lid was closed gently, andthe spin column was centrifuged for 2 min at ≥8000×g to wash the spincolumn membrane. The collection tube with the flow-through wasdiscarded. The RNeasy MinElute spin column was placed in a new 2 mLcollection tube. The lid of the spin column was opened, and the spincolumn was centrifuged at full speed (18,565×g) for 5 minutes. Thecollection tube with the flow-through was discarded. The RNeasy MinElutespin column was placed in a new 1.5 mL collection tube. 14 μL RNase-freewater was directly added to the center of the spin column membrane. Thelid was closed gently, and was centrifuged for 1 min at full speed(18,565×g) to elute the RNA. The spin column was discarded, and the 1.5mL tube was stored with extracted RNA at −80° C. until furtherprocessing.

If >5×10⁵ cells were processed, 500 μL Buffer RPE was added to theRNeasy spin column. The lid was closed gently, and was centrifuged for 2min at ≥8000×g to wash the spin column membrane. The RNeasy spin columnwas placed in a new 2 mL collection tube. The old collection tube withthe flow-through was discarded. The collection tube was then centrifugedat full speed (18,565×g) for 1 min. The RNeasy spin column was placed ina new 1.5 mL collection tube. 30-50 μL of RNase-free water was addeddirectly to the spin column membrane. The lid was closed gently, and wascentrifuge for 1 min at ≥8000×g to elute the RNA.

Genomic DNA Purification

500 μL Buffer AW1 was added to the AllPrep DNA spin column (previouslyplaced in a new 2 ml collection tube and stored at room temperature orat 4° C.). The lid was closed gently, and the spin column wascentrifuged for 15 s at ≥8000×g. The flow-through was discarded. Thespin column was re-used in the following step. 500 μL Buffer AW2 wasadded to the AllPrep DNA spin column. The lid was closed gently, and wascentrifuged for 2 min at full speed (18,565×g) to wash the spin columnmembrane. After centrifugation, the AllPrep DNA spin column wascarefully removed from the collection tube. If the column contacted theflow-through, the collection tube was emptied and the spin column wascentrifuged again for 1 min at full speed.

If <5×10⁵ cells were processed, the AllPrep DNA spin column was placedin a new 1.5 mL collection tube. 50 μL Buffer EB was added (preheated to70° C.) directly to the spin column membrane and the lid was closed andwas incubated at room temperature for 2 min. The spin column wascentrifuged for 1 min at ≥8000×g to elute the DNA. Repeat addition ofBuffer EB was conducted and centrifugation to elute further DNA. A new1.5 mL collection tube was used to collect the second DNA eluate, andthen was combined with the first eluate. The spin column was discarded,and was stored in the 1.5 mL tube with extracted DNA at 4° C. untilfurther processing.

If >5×10⁵ cells were processed, the AllPrep DNA spin column was placedin a new 1.5 mL collection tube. 50 μL Buffer EB was added directly tothe spin column membrane and the lid was closed. The spin column wasincubated at room temperature for 1 min was centrifuged for 1 min at≥8000×g to elute the DNA. Repeat addition of Buffer EB was conducted andcentrifugation to elute further DNA. A new 1.5 mL collection tube wasused to collect the second DNA eluate, and then was combined with thefirst eluate. The spin column was discarded, and was stored in the 1.5mL tube with extracted DNA at 4° C. until further processing.

Troubleshooting processes included, but were not limited to thefollowing in Table 3.

TABLE 3 Troubleshooting Steps for AllPrep DNA/RNA Procedure IncidentTroubleshooting steps Clogged AllPrep DNA and Clogged column can becaused by the following: RNeasy spin column a) Inefficient disruptionand/or homogenization 1. Increase g-force and centrifugation time ifnecessary 2. In subsequent preparations, reduce the amount of startingmaterial and/or increase the homogenization time. b) Too much startingmaterial 1. Reduce the amount of starting material. It is essential touse the correct amount. c) Centrifugation temperature too low 1. Thecentrifugation temperature should be 20-25° C. Some centrifuges may coolLow nucleic acid yield Low nucleic acid yield can be caused by thefollowing: a) Inefficient disruption and/or homogenization 1. Insubsequent preparations, reduce the amount of starting material and/orincrease the homogenization time. b) Too much starting material 1.Reduce the amount of starting material. It is essential to use thecorrect amount. c) RNA still bound RNeasy spin column membrane 1. RepeatRNA elution, but incubate the RNeasy spin column on the benchtop for 10min with RNase-free water before centrifuging d) DNA still bound toAllPrep DNA spin column membrane 1. Repeat DNA elution, but incubate theAllPrep DNA spin column on the benchtop for 10 minutes with Buffer EBbefore centrifuging. e) Ethanol carryover 1. During the second wash withBuffer RPE, be sure to centrifuge at ≥8000 x g for 2 min at 20-25° C. todry the RNeasy spin column membrane. 2. Perform the optionalcentrifugation to dry the RNeasy spin column membrane if anyflow-through is present on the outside of the column. DNA contaminatedwith RNA This can be caused by the following: a) Lysate applied to theAllPrep DNA spin column contains ethanol 1. Add ethanol to the lysateafter passing the lysate through the AllPrep DNA spin column. b) Sampleis affecting pH of homogenate 1. The final homogenate should have a pHof 7. Make sure that the sample is not highly acidic or basic.Contamination of RNA with This can be caused by the following: DNAaffects downstream a) Cell number too high applications 1. For some celltypes, the efficiency of DNA binding to the AllPrep DNA spin column maybe reduced when processing very high cell numbers. b) Tissue has highDNA content 1. For certain tissues with extremely high DNA content(e.g., thymus), some DNA will pass through the AllPrep DNA spin column.Try using smaller samples. Alternatively, perform DNase digestion on theRNeasy spin column membrane, or perform DNase digestion of the elutedRNA followed by RNA cleanup. Low A₂₆₀/A₂₈₀ value in RNA Use 10 mM TrisHCl, pH 7.5, not RNase-free water, to dilute the eluate sample beforemeasuring purity. RNA degraded RNA degradation can be caused by thefollowing: a) Inappropriate handling of starting material 1. Ensure thattissue samples are properly stabilized and stored in RNAlater RNAStabilization Reagent. 2. Ensure that frozen tissue was flash-frozenimmediately in liquid nitrogen and properly stored at −70° C. Performthe AllPrep DNA/RNA procedure quickly, especially the first few steps.b) RNase contamination 1. Although all AllPrep buffers have been testedand are guaranteed RNase-free, RNases can be introduced during use. Becertain not to introduce any RNases during the AllPrep DNA/RNA procedureor later handling. DNA fragmented This can happen when homogenization istoo vigorous. The length of purified DNA depends strongly on thehomogenization conditions. If longer DNA fragments are required, keepthe homogenization time to a minimum or use a gentler homogenizationmethod if possible. Nucleic acid concentration too The elution volumemay be too high. Elute nucleic acids in a smaller low volume. Do not useless than 50 μL Buffer EB for the AllPrep DNA spin column, or less than1 × 30 μL of water for the RNeasy spin column. Although eluting insmaller volumes results in increased nucleic acid concentrations, yieldsmay be reduced. Nucleic acids do not perform a) Salt carryover duringelution well in downstream 1. Ensure that buffers are at 20-30° C.experiments 2. Ensure that the correct buffer is used for each step ofthe procedure. 3. When reusing collection tubes between washing steps,remove residual flow-through from the rim by blotting on clean papertowels. b) Ethanol carryover 1. During the second wash with Buffer RPE,be sure to centrifuge at ≥8000 x g for 2 min at 20-25° C. to dry theRNeasy spin column membranes. After centrifugation, carefully remove thecolumn from the collection tube so that the column does not contact theflow-through. Otherwise, carryover of ethanol will occur. 2. Perform theoptional centrifugation to dry the RNeasy spin column membrane if anyflow-through is present on the outside of the column.

Example 3: The Constructions of DNA Libraries for Sequencing

DNA libraries were prepared before performing the downstream sequencing.In brief, library Construction (LC) consisted of shearing extractedgenomic DNA to a pre-determined size (e.g., 200 base pairs), and thenprepared the libraries for Hybrid Capture. Fragmented DNA was repaired,and unique molecular barcodes were added to each DNA sample, so thateach DNA sample could be identified during sequencing. DNA samples werepurified before amplifying the barcoded libraries with Polymerase ChainReaction (PCR). The DNA samples were then purified again before theamount and quality of each library was assessed using Quality Control(QC) steps described according to manufacturer instructions.

In general, library Construction consisted of four main steps. First,genomic DNA was sheared to about 200 base pairs using the SureSelect XTHS Enzymatic Fragmentation Kit. The shearing resulted in DNA fragmentsthat needed to undergo blunt end repair. The second step was therepairing and dA-tailing of the DNA ends. This step added an “A” base tothe 3′ end of a blunt phosphorylated DNA fragment. This treatmentcreated compatible overhangs for the next step of DNA samplepreparation. In the third step, specific molecular-barcoded adaptorswere ligated to each sample using the “A” base overhang created in thelast step. Adapters were platform-specific sequences for fragmentrecognition by the sequencer: for example, the P5 and P7 sequencesenabled library fragments to bind to the flow cells of Illuminaplatforms. The molecular barcode was unique to each sample being run andallowed multiple samples to be subsequently mixed together, with thebarcode used to identify each sample at sequencing. The samples werethen purified using AMPure XP beads. In the final step, theadaptor-ligated libraries were amplified with PCR, and then purified asecond time using AMPure XP beads. Some or all of the procedures weremanaged and conducted by lab personnel. In the event that qualitycontrol related issues arise, lab personnel will notify the provider(e.g., healthcare provider) of the biopsy sample or the extracted DNA.

Normalization of Samples for Library Construction

Samples were normalized to 10-200 ng in 7 μL, using low TE. The maximumamount of DNA available was used for each sample, within the rangeprovided. The lab personnel then navigated to the normalizationspreadsheet, which was located in the Clinical Lab Documents folder inthe shared Google Drive. The tab labeled “LC Normalization” wasselected. The Sample ID was entered into column A. The measuredconcentration was entered into column B. The spreadsheet automaticallycalculated the volumes of sample and low TE required for normalizationin columns G and H. If the concentration of a sample was on the lowerside, the spreadsheet calculated a volume of sample >7 μL and a volumeof low TE<0 If this occurred, only 7 μL of sample was used and was notdiluted. The volumes calculated in the spreadsheet were used fornormalizing the appropriate volumes into a 96 well semi-skirted PCRplate.

Enzymatic DNA Shearing

In some embodiments, DNA is fragmented using an endonuclease (e.g.,using an enzymatic fragmentation kit from SureSelect). In someembodiments, a SureSelect Fragmentation Buffer and Enzyme were thawed onice. Fragmentation Buffer was vortexed and spun down before use. A 3 μLFragmentation master mix for each sample was prepared using 2 μL of 5×SureSelect Fragmentation Buffer mixed with 1 μL SureSelect FragmentationEnzyme. In some embodiments, larger volumes can be prepared for multiplereactions (e.g., 18 μL of 5× SureSelect Fragmentation Buffer mixed with9 μL SureSelect Fragmentation Enzyme for 8 reactions including excess).

3 μL Fragmentation master mix was added to each sample well and wasmixed by pipetting up and down 20 times. The plate was immediatelyplaced on the thermal cycler on the Enzymatic Fragmentation program(step 1: 37° C. for 15 minutes; step 2: 65° C. for 5 minutes and step 3:4° C. on hold).

Repair and dA-Tail the Fragmented DNA Ends

In some embodiments, fragmented DNA is repaired and dA-tailed, forexample using a kit from SureSelect. In some embodiments, reagents werefirst thawed on ice (e.g., from −20° C. storage) and Agencourt AMPure XPbeads were equilibrated to room temperature for at least 30 minutes. EndRepair A-Tailing Buffer, Ligation Buffer, End Repair A-Tailing EnzymeMix, T4 DNA Ligase, and Adaptor Oligo Mix (all from SureSelect XT HSLibrary Preparation Kit for ILM) were mixed by vortexing.

In some embodiments, a ligation master mix was prepared. The thawed vialof Ligation Buffer was vortexed for 15 seconds at high speed to ensurehomogeneity. The Ligation Buffer used in this step was viscous and wasmixed thoroughly by vortexing at high speed for 15 seconds beforeremoving an aliquot for use. When combined with other reagents, theLigation Buffer was mixed well by pipetting up and down 15-20 timesusing a pipette set to at least 80% of the mixture volume or byvortexing at high speed for 10-20 seconds. A flat-top vortex mixer wasused when vortexing strip tubes or plates throughout the protocol. Whenreagents were mixed by vortexing, the occurrence of adequate mixing wasvisually verified.

In some embodiments, an appropriate volume of Ligation master mix wasprepared by combining reagents as follows: a 25 μL reaction volume for 1reaction containing 23 μL of Ligation Buffer and 2 μL of T4 DNA Ligase,a 225 μL reaction volume for 8 reactions (including excess) containing207 μL of Ligation Buffer and 18 μL of T4 DNA Ligase, a 625 μL reactionvolume for 24 reactions (including excess) containing 575 μL of LigationBuffer and 50 μL of T4 DNA Ligase.

The Ligation Buffer was slowly pipetted into a 1.5 mL Eppendorf tube,ensuring that the full volume was dispensed. The T4 DNA Ligase wasslowly added, rinsing the enzyme tip with buffer solution after additionand was mix well by slowly pipetting up and down 15-20 times or sealedthe tube and vortexed at high speed for 10-20 seconds. The liquid wasspun briefly to collect the liquid, which was kept at room temperaturefor a minimum of 30 minutes, but not more than 45 minutes before use.

The thawed vial of End Repair-A Tailing Buffer was thawed for 15 secondsat high speed to ensure homogeneity. The solution was visuallyinspected. If any solids were observed, vortexing was continued untilall solids were dissolved. The appropriate volume of EndRepair/dA-Tailing master mix was prepared by combining the followingreagents: a 20 μL reaction volume for 1 reaction containing 16 μL of EndRepair A-Tailing Buffer and 4 μL of End Repair A-Tailing Enzyme Mix, a180 μL reaction volume for 8 reactions (including excess) containing 144μL of End Repair A-Tailing Buffer and 36 μL of End Repair A-TailingEnzyme Mix, a 500 μL reaction volume for 24 reactions (including excess)containing 400 μL of End Repair A-Tailing Buffer and 100 μL of EndRepair A-Tailing Enzyme Mix.

The End Repair-A Tailing Buffer was slowly pipetted into a 1.5 mLEppendorf tube, ensuring that the full volume was dispensed. The EndRepair-A Tailing Enzyme Mix was slowly added, rinsing the enzyme tipwith buffer solution after addition and was mixed well by pipetting upand down 15-20 times or sealed the tube and vortexed at high speed for5-10 seconds. The liquid was spun briefly to collect and was kept onice. 20 μL of the End Repair/dA-Tailing master mix was added to eachsample well containing approximately 50 μL of fragmented DNA and wasmixed by pipetting up and down 15-20 times using a pipette set to 60 μLor capped the wells and vortexed at high speed for 5-10 seconds. Thesamples were briefly spun and then the plate or strip tube wasimmediately placed in the Thermal cycler and started the EndRepair/dA-Tailing program (step 1: 20° C. for 15 minutes; step 2: 72° C.for 15 minutes and step 3: 4° C. on hold).

Ligate the Molecular-Barcoded Adaptor

Once the thermal cycler reached the 4° C. Hold step, the samples weretransferred to ice while setting up this step. To eachend-repaired/dA-tailed DNA sample (approximately 70 μL), 25 μL of theLigation master mix that was prepared previously was added, kept at roomtemperature, and was mixed by pipetting up and down at least 10 timesusing a pipette set to 85 μL or capped the wells and vortexed at highspeed for 5-10 seconds. The samples were briefly spun. 5 ilt of AdaptorOligo Mix (white capped tube) was added to each sample and was mixed bypipetting up and down 15-20 times using a pipette set to 85 μL or cappedthe wells and vortexed at high speed for 5-10 seconds. The Ligationmaster mix and the Adaptor Oligo Mix were added to the samples inseparate addition steps as directed in the steps above, mixing aftereach addition. The samples were briefly spun and the plate or strip tubewas then immediately placed in the thermal cycler and bean the Ligationprogram (step 1: 20° C. for 30 minutes; step 2: 4° C. on hold). Thesample wells were sealed and were stored overnight at either 4° C. or−20° C. if next steps were not continued.

Purify the Samples Using AMPure XP Beads

The AMPure XP beads were verified and held at room temperature for atleast 30 minutes before use. The beads were not frozen at any time. 400μL of 70% ethanol per sample was prepared, plus excess, for use in thefollowing steps. The freshly-prepared 70% ethanol may be used forsubsequent purification steps run on the same day. The complete LibraryPreparation protocol required 0.8 ml of fresh 70% ethanol per sample.The AMPure XP bead suspension was mixed well so that the reagentappeared homogeneous and consistent in color. 80 μL of homogeneousAMPure XP beads were added to each DNA sample (approximately 100 μL) inthe PCR plate or strip tube and were pipetted up and down 15-20 times orcapped the wells and vortexed at high speed for 5-10 seconds to mix.Samples were incubated for 5 minutes at room temperature. The plate orstrip tube was put into a magnetic separation device (DynaMag −96 SideMagnet) and was waited for the solution to clear (approximately 5 to 10minutes). The plate or strip tube was placed in the magnetic stand. Thecleared solution from each well was carefully removed and discarded. Thebeads were not touched while removing the solution. The plate or striptube was continued to keep in the magnetic stand while 200 μL offreshly-prepared 70% ethanol in each sample well was dispensed. Anydisturbed beads were allowed to settle after 1 minute and the ethanolwas removed. The plate or strip tube was placed in the magnetic standwhile you dispense another 200 μL of freshly-prepared 70% ethanol ineach sample well. Any disturbed beads were allowed to settle after 1minute and the ethanol was removed. The wells were sealed with stripcaps, and the samples were then briefly spun to collect the residualethanol. The plate or strip tube was returned to the magnetic stand for30 seconds. The residual ethanol was removed with a P20 pipette. Thesamples were air dried for 5 minutes. The bead pellet was not dried tothe point that the pellet appeared cracked during any of the bead dryingsteps in the protocol. Elution efficiency was significantly decreasedwhen the bead pellet was excessively dried. 35 μL nuclease-free waterwas added to each sample well. The wells were sealed with strip caps,then were mixed well on a vortex mixer and the plate or strip tube wasbriefly spun to collect the liquid and was incubated for 2 minutes atroom temperature. The plate or strip tube was put in the magnetic standand was left for approximately 5 minutes, until the solution was clear.The cleared supernatant (approximately 34.5 μL) was removed to a freshPCR plate or strip tube sample well and was kept on ice. The beads couldbe discarded at this time. It was noted that it may not be possible torecover the entire 34.5 μL supernatant volume at this step. The maximumpossible amount of supernatant was transferred for further processing.To maximize recovery, the cleared supernatant was transferred to a freshwell in two rounds of pipetting, using a P20 pipette set at 17.25

Amplify the Adaptor-Ligated Library

The following PCR reagents from the SureSelect XT HS Library PreparationKit for ILM (PrePCR) were thawed, mixed and kept on ice. Herculase IIFusion DNA Polymerase was mixed by pipetting up and down 15-20 times. 5×Herculase II Reaction Buffer was mixed by vortexing. 100 Mm dNTP Mix wasmixed by vortexing. Forward Primer and SureSelect XT HS Index PrimersA01 through H04 were separately mixed by vortexing. The appropriateindex assignments for each sample were determined. The SureSelect XT HSIndex Primers were provided in single-use aliquots. To avoidcross-contamination of libraries, each vial was discarded after use inone library preparation reaction. Residual volume was not re-used orretained for subsequent experiments.

Appropriate volume of pre-capture PCR reaction mix was prepared asdescribed below on ice and then was mixed well on a vortex mixer. Forexample, a 13.5 μL reaction volume for 1 reaction contained 10 μL of 5×Herculase II Reaction Buffer, 0.5 μL of 100 mM dNTP Mix, 2 μL of ForwardPrimer, and 1 μL of 5× Herculase II Fusion DNA, a 121 μL reaction volumefor 8 reactions (including excess) contained 90 μL of 5× Herculase IIReaction Buffer, 4.5 μL of 100 mM dNTP Mix, 18 μL of Forward Primer, and9 μL of 5× Herculase II Fusion DNA, or a 337 μL reaction volume for 24reactions (including excess) contained 250 μL of 5× Herculase IIReaction Buffer, 12.5 μL of 100 mM dNTP Mix, 50 μL of Forward Primer,and 25 μL of 5× Herculase II Fusion DNA.

13.5 μL of the PCR reaction mixture was added to each purified DNAlibrary sample (34.5 μL) in the PCR plate wells. 2 μL of the appropriateSureSelect XT HS Index Primer was added to each reaction. The wells werecapped and were then vortex at high speed for 5 seconds. The plate orstrip tube was spun briefly to collect the liquid and any bubbles werereleased. Before adding the samples to the thermal cycler, thePre-Capture PCR program was started according to the conditions below tobring the temperature of the thermal block to 98° C. Once the thermalcycler reached 98° C., the sample plate or strip tube was immediatelyplaced in the thermal block and the following temnerature cyclingnrotocol was nerformed.

Segment Number of Cycles Temperature Time 1 1 98° C. 2 minutes 2 8 98°C. 30 seconds 60° C. 30 seconds 72° C. 1 minute 3 1 72° C. 5 minutes 4 1 4° C. HoldPurify the Amplified Library with AMPure XP Beads

The AMPure XP beads were verified to be held at room temperature for atleast 30 minutes before use. 400 μL of 70% ethanol per sample wasprepared, plus excess. The AMPure XP bead suspension was mixed well, sothat the reagent appeared homogeneous and consistent in color. 50 μL ofhomogenous AMPure XP beads were added to each amplification reaction inthe PCR plate or strip tube and was pipetted up and down 15-20 times tomix. Samples were incubated for 5 minutes at room temperature. The platewas out into a magnetic separation device (DynaMag −96 Side Magnet) andwas waited up to 5 minutes for the solution to clear. The plate or striptube was put on the magnetic stand and the cleared solution from eachwell was carefully removed and discarded. The beads were touched whileremoving the solution. The plate or strip tube was continued to be keptin the magnetic stand while dispensing 200 μL of freshly-prepared 70%ethanol into each sample well. Disturbed beads were allowed to settleafter the wait for 1 minute, then removed the ethanol. The ethanol washwas repeated once. The wells were sealed with strip caps, then thesamples were briefly spun to collect the residual ethanol. The plate orstrip tube was returned to the magnetic stand for 30 seconds. Theresidual ethanol was removed with a P20 pipette. The samples were driedby keeping the unsealed plate or strip tube at room temperature for upto 5 minutes, until the residual ethanol was just evaporated. 15 μLnuclease-free water was added to each sample well. The wells were sealedwith strip caps, then were mixed well on a vortex mixer and the plate orstrip tube was briefly spun to collect the liquid and was incubate for 2minutes at room temperature. The plate or strip tube was put in amagnetic stand and was left for 3 minutes, until the solution was clear.15 μL of the cleared supernatant was removed to a fresh PCR plate orstrip tube sample well and was kept on ice. The new PCR plate was sealedcontaining libraries. The beads were discarded. The quality of samplelibraries was checked using the an electrophoresis device, for examplean automated electrophoresis device (e.g., a TapeStation Systemavailable from Agilent, www.agilent.com) and a spectrophotometer, forexample a small volume full-spectrum, UV-visible spectrophotometer(e.g., Nanodrop spectrophotometer available from ThermoFisherScientific, www.thermofisher.com), or the plate was stored at −20° C.

Resources from the manufacturers, including Agilent SureSelect XT HSTarget Enrichment System for Illumina Paired-End Multiplexed SequencingLibrary protocol and Agilent SureSelect XT HS and XT Low Input EnzymaticFragmentation Kit protocol, were incorporated by reference herein.

Example 4: Hybridization-Capture and Target Enrichment of DNA Libraries

Hybridization-Capture based target enrichment was used directly afterLibrary Construction described in Example 3. This protocol described thesteps to hybridize the prepared gDNA libraries with a target-specificcapture probes. Target enrichment worked by mixing target-specificbiotinylated probes with the DNA Library. The probes were bound to thetargets which were then isolated by streptavidin coated magnetic beadpulldown, leaving uncaptured DNA (the areas of the genome that we do notwant) behind. The steps to hybridize the prepared DNA libraries with atarget-specific capture library were provided. After librarypreparation, the libraries were denatured and biotin-labeled probesspecific to targeted regions were used for hybridization. The pool wasenriched for regions of interest by adding streptavidin-coated beadsthat were bound to the biotinylated probes. DNA fragments bound to thestreptavidin-coated beads via biotinylated probes were magneticallypulled down from the solution. The enriched fragments were then elutedfrom the beads. Each DNA library sample must be hybridized and capturedindividually. Some or all of the procedures were managed and conductedby lab personnel. In the event that quality control related issuesarise, lab personnel will notify the provider (e.g., healthcareprovider) of the biopsy sample or the extracted DNA. As a general workprocedure, before beginning the procedure, work surfaces and pipetteswere thoroughly disinfected by wiping down with 10% bleach, followed by70% ethanol. The same cleaning process was followed after completion ofwork procedure.

Normalization of samples for Hybrid Capture

12 μL nuclease-free water was used to normalize samples to 500-1000 ng.The maximum amount of DNA available was used for each sample, within therange provided. The lab personnel then navigated to the normalizationspreadsheet, located in the Clinical Lab. Documents folder in the sharedGoogle Drive. The tab labeled “HC Normalization” was selected. TheSample ID was entered into column A and the measured concentration wasentered into column B. The spreadsheet automatically calculated thevolumes of sample and low TE required for normalization in columns G andH. If the concentration of a sample was on the lower side, thespreadsheet calculated a volume of sample >12 μL and a volume ofnuclease-free water <0 μL. If this occurred, only 12 μL of sample wasused, and the sample was not diluted. Using the volumes calculated inthe spreadsheet, appropriate volumes were normalized into a 96 wellsemi-skirted PCR plate.

Hybridize DNA Samples to the Capture Library

In some embodiments, the component reagents for hybridization using aSureSelect kit were thawed, according to the thawing conditionsdescribed below. Each reagent was vortexed to mix, then tubes were spunbriefly to collect the liquid.

To each DNA library sample well, 5 μl SureSelect XT HS and XT Low InputBlocker Mix (previously thawed on ice) were added. The wells were cappedand then vortexed at high speed for 5 seconds. The plate was spunbriefly to collect the liquid and any bubbles were released. The sealedsample plates were transferred to the thermal cycler and theHybridization program was started. The thermal cycler was programmed topause during Segment 3 of the Hybridization program to allow additionalreagents to be added to the Hybridization wells, as described in thenext sections. During Segments 1 and 2 of the thermal cycling program,the additional reagents were prepared as described in the next section.If needed, these steps could be finished after the thermal cyclerprogram pauses in Segment 3. A 25% solution of SureSelect RNase Block(e.g., previously thawed on ice) was prepared and was mixed well byvortexing, and the mix was briefly centrifuged and then kept on ice.

Further, a Capture Library Hybridization Mix was prepared as follow forone or more reactions. For example, a 13 μL reaction volume for 1reaction contained 2 μL of 25% RNase Block solution, 5 μL of CaptureLibrary ≥3 Mb (e.g., previously thawed on ice), and 6 μL of SureSelectFast Hybridization Buffer (e.g., previously thawed and kept at roomtemperature), a 117 μL reaction volume for 8 reactions (includingexcess) contained 18 μL of 25% RNase Block solution, 45 μL of CaptureLibrary ≥3 Mb, and 54 μL of SureSelect Fast Hybridization Buffer, or a325 μL reaction volume for 24 reactions (including excess) contained 50μL of 25% RNase Block solution, 125 μL of Capture Library ≥3 Mb, and 150μL of SureSelect Fast Hybridization Buffer.

The listed reagents were combined at room temperature, mix well byvortexing at high speed for 5 seconds, and then were spun down briefly.The mixture was just prepared before pausing the thermal cycler inSegment 3. The mixture was kept at room temperature briefly until themixture was added to the DNA samples on the cycler. Solutions containingthe Capture Library were not kept at room temperature for extendedperiods.

The thermal cycler was pauses at Segment 3 of the Hybridization program(1 minute at 65° C.). With the cycler paused, and while keeping theDNA+Blocker samples in the cycler, 13 μl of the room-temperature CaptureLibrary Hybridization Mix was transferred to each sample well and wasmixed well by pipetting up and down slowly 10 times. The wells weresealed with fresh domed strip caps and that all wells were made sure tocompletely sealed. A compression pad was placed on the plate to preventevaporation during hybridization. The Play button was pushed to resumethe thermal cycling program to allow hybridization of the prepared DNAsamples to the Capture Library. Wells were adequately sealed to minimizeevaporation to prevent results from being negatively impacted.

Prepare Streptavidin-Coated Magnetic Beads

In some embodiments, the bead preparation steps began approximately onehour after starting hybridization. Reagents for capture from theSureSelect XT HS Target Enrichment Kit ILM Hyb Moedule included theSureSelect Binding Buffer, the SureSelect Wash Buffers 1 and 2 (e.g.,all kept at room temperature), and Dynabead MyOne Streptavidin T1 (e.g.,stored at 2° C. to 8° C.). Dynabeads MyOne Streptavidin T1 magneticbeads were brought to room temperature for at least 30 minutes. TheDynabeads MyOne Streptavidin T1 magnetic beads were vigorously resuspendon a vortex mixer. The magnetic beads settled during storage. For eachhybridization sample, 50 μl of the resuspended beads was added to wellsof a fresh PCR plate. The beads were washed by adding 200 μl ofSureSelect Binding Buffer, mixing by pipetting up and down 20 times orcapping the wells and vortexing at high speed for 5-10 seconds. Theplate was put into a magnetic separator device and waited for thesolution to clear, approximately 5 minutes. The supernatant was removedand discarded. The wash steps were repeated two more times, for a totalof three washed. The beads were resuspended in 200 μl of SureSelectBinding Buffer.

Capture the Hybridized DNA Using Streptavidin-Coated Beads

After the hybridization step was complete on the thermal cycler, thesamples were transferred to room temperature. The entire volume(approximately 300 was immediately transferred of each hybridizationmixture to the wells containing 200 μl of washed streptavidin beadsusing a multichannel pipette. The mixture was pipetted up and down 5-8times to mix and then the wells were sealed with fresh caps. The captureplate was incubated on a 96-well plate mixer and was mixed at 1500 rpmfor 30 minutes at room temperature. The samples were properly mixed inthe wells. During the 30-minute incubation for capture, SureSelect WashBuffer 2 was pre-warmed in the thermal cycler at 70° C. by placing 200μL aliquots of Wash Buffer 2 in wells of a fresh 96-well plate andaliquot 6 wells of buffer for each DNA sample in the run.

The wells were capped and then incubated in the thermal cycler, withheated lid ON, held at 70° C. until time for use. When the 30-minutesample incubation period was complete, the samples were briefly spun tocollect the liquid. The plate was put in a magnetic separator to collectthe beads and was waited until the solution was clear, then thesupernatant was removed and discarded. The beads were resuspended in 200μl of SureSelect Wash Buffer 1 and were mixed by pipetting up and down15-20 times, until beads were fully resuspended. The plate was put inthe magnetic separator and was waited for the solution to clear(approximately 1 minute), and then the supernatant was removed anddiscarded. The plate was removed from the magnetic separator and wastransferred to room temperature. The beads were washed with Wash Buffer2, using the steps below: 1) resuspend the beads in 200 μl of 70° C.pre-warmed Wash Buffer 2; 2) pipetted up and down 15-20 times, untilbeads were fully resuspended; 3) incubated the samples for 5 minutes at70° C. on the thermal cycler with the heated lid on; 4) After the 5minute incubation, the plate was put in the magnetic separator at roomtemperature; 5) the solution was waited to clear (approximately 1minute), then the supernatant was removed and discarded; and 6) the washsteps were repeated five more times for a total of 6 washes.

After verifying that all wash buffer was removed, 25 μl of nuclease-freewater was added to each sample well and then pipetted up and down 8times to resuspend the beads. The plate was sealed and the samples werekept on ice until they were used later. Captured DNA was retained on thestreptavidin beads during the post-capture amplification step.

Amplify the Captured Libraries

In some embodiments, reagents for post-capture PCR amplification werethawed and kept on ice, and included a Herculase II Fusion DNAPolymerase (mixed by pipetting up and down), a 5× Herculase II ReactionBuffer, 100 mM dNTP Mix, and SureSelect Post-Capture Primer Mix (e.g.,all mixed by vortexing).

The Post-Capture PCR thermal cycler program was started to preheat thecycler. Appropriate volumes of PCR reaction mix were prepared, on ice,and mixed well on a vortex mixer. For example, a 25 μL reaction volumefor 1 reaction contained 12.5 μL of nuclease-free water, 10 μL of 5×Herculase II Reaction Buffer, 1 μL of Herculase II Fusion DNAPolymerase, 0.5 μL of 100 mM dNTP Mix, and 1 μL of SureSelectPost-Capture Primer Mix.

For each reaction, 25 μl of the PCR reaction mix was added to eachsample well containing bead-bound target-enriched DNA. The PCR reactionswere mixed well by pipetting up and down until the bead suspension washomogeneous. Splashing samples onto well walls was avoided and thesamples were not spun at this step. The plate was sealed well. The platewas placed in the thermal cycler and compression pad was placed on theplate to prevent evaporation. The Play button was pressed to resume thePost-Capture PCR thermal cycler program. When the PCR amplificationprogram was complete, the plate was spun briefly. Thestreptavidin-coated beads were removed by placing the plate on themagnetic stand at room temperature. The solution was waited to clear(approximately 2 minutes), and then each supernatant (approximately 50μl) was transferred to wells of a fresh plate. The beads could bediscarded at this time.

Purify the Amplified Capture Libraries Using AMPure XP Beads

In brief, the AMPure XP beads were come to room temperature for at least30 minutes. The beads were not frozen at any time. 400 μl of fresh 70%ethanol per sample was prepared for later use in step as describedherein. The AMPure XP bead suspension was mixed well so that thesuspension appeared homogeneous and consistent in color. 50 μl of thehomogeneous AMPure XP bead suspension was added to each amplified DNAsample (approximately 50 μl) in the PCR plate and was mixed well bypipetting up and down 15-20 times, or the wells were capped and vortexedat high speed for 5-10 seconds. The beads were made sure to be in ahomogeneous suspension in the sample wells. Each well had a uniformcolor with no layers of beads or clear liquid present. The samples werethen incubated for 5 minutes at room temperature. The plate was put onthe magnetic stand at room temperature and was waited for the solutionto clear (approximately 3 to 5 minutes). While keeping the plate on themagnetic stand, the cleared solution from each well was carefullyremoved and discarded. The beads were not disturbed while removing thesolution. The plate was continued to be placed on the magnetic standwhile dispensing 200 μl of freshly-prepared 70% ethanol in each samplewell and waited for 1 minute to allow any disturbed beads to settle,then the ethanol was removed.

The ethanol wash was repeated once for a total of two washes. All of theethanol at each wash step was carefully removed. The wells with thensealed with strip caps, and then were briefly spin to collect theresidual ethanol. The plate was returned to the magnetic stand for 30seconds. The residual ethanol was removed with a P20 pipette. Next, thesamples were dried by keeping them at room temperature until the wellswere dry (about 5-10 minutes). The bead pellet was ensured to not startto crack, as this was a sign of over drying. 25 μl of nuclease-freewater was then added to each sample well. The sample wells were sealed,mixed well on a vortex mixer and then briefly spun to collect the liquidwithout pelleting the beads. The wells were incubated for 2 minutes atroom temperature. The plate was put on the magnetic stand and left untilthe solution was clear. A new PCR plate was labeled with the Run ID. Thecleared supernatant (approximately 25 μl) was transferred to the freshplate. The beads could be discarded at this time. Then, the quality ofcaptured libraries was checked by qPCR methods by using the RocheLightCycler SOP, or stored at −20° C.

Example 5: The Constructions of RNA Libraries for Sequencing

RNA libraries were prepared before performing the downstream sequencing.In brief, this protocol explained how to convert cDNA was synthesizedfrom mRNA in a total RNA sample, into a library of DNA for hybridizationcapture prior to sequencing. The reagents provided in an Illumina TruSeqStranded mRNA library prep workflow were used.

The process involved the adenylation of the 3′ ends of blunt endedfragments by the addition of one adenine nucleotide. This prevented themfrom ligating to each other during adapter ligation reaction. Onecorresponding thymine nucleotide on the 3′ end of the adapter provided acomplementary overhang for ligating the adapter to the fragment. Thisstrategy ensured a low rate of chimera (concatenated template)formation. In the next step, multiple indexing adapters were ligated tothe ends of the ds cDNA fragments, which prepared them for hybridizationonto a flow cell. Fragments with no adapters were not hybridized tosurface-bound primers on the flow cell. Fragments with an adapter on oneend can hybridize to surface bound primers, but did not form clusters.The DNA fragment enrichment process used PCR to selectively enrich thoseDNA fragments that had adapter molecules on both ends and to amplify theamount of DNA in the library. PCR was performed with a PCR PrimerCocktail that annealed to the ends of the adapters. RNA LibraryConstruction consisted of three steps as described herein. Introductionabove followed by a library clean up and library quantitation by qPCRper the protocol of using a nucleic acid amplification device (e.g., aPCR system), for example a real-time PCR system (e.g., a LightCyclerInstrument 480 available from Roche, www.lifescience.roche.com) by RocheLife Science. Accurate quantification achieved by qPCR allowed to createoptimum cluster densities across all four lanes of the flow cell.

Some or all of the procedures were managed and conducted by labpersonnel. In the event that quality control related issues arise, labpersonnel will notify the provider (e.g., healthcare provider) of thebiopsy sample or the extracted RNA.

Adenylate 3′ Ends

The reagents were prepared according to the conditions below. In brief,2.5 μL Resuspension buffer was added to each well containing sample(Resuspension Buffer is typically stored at −25° C. to −15° C. and letstand for 30 minutes to bring to room temperature before use). 12.5 μLA-Tailing Mix was added to each well, and then was mixed thoroughly bypipetting up and down 10 times (A-Tailing Mix is typically stored at−25° C. to −15° C. and thawed at room temperature). The plate was sealedand centrifuged at 280×g for 1 minute. The plate was incubated on theATAIL70 program of the thermal cycler. The ATAIL70 program was as thefollowing steps: 1) preheat lid: 100° C. hold time, 2) step 1: 37° C.for 30 minutes, 3) step 2: 70° C. for 5 minutes, and 4) step 3: 4° C.hold time. The plate was then centrifuged at 280×g for 1 minute.

Ligate Adapters

The reagents were prepared according to the conditions below. In brief,the RNA Adapter tubes were centrifuged at 600×g for 5 seconds. LigationMix was removed from −25° C. to −15° C. storage. The following reagentswere added in the order listed to each well: 1) 2.5 μL ResuspensionBuffer, 2) 2.5 μL Ligation Mix, and 3) 2.5 μL RNA Adapter Indexes. Themixed reagents were then mixed thoroughly by pipetting up and down 10times and were centrifuged at 280×g for 1 minute. The plate was placedon the thermal cycler and the LIG program was run. The LIG program wasas the following: 1) preheat lid: 100° C. hold time, 2) step 1: 30° C.for 10 minutes, and 3) step 2: 4° C. hold time. The Stop Ligation Bufferwas centrifuged at 600×g for 5 seconds. Once the LIG program stopped,the plate was removed from thermal cycler and 5 μL Stop Ligation Bufferwas added to each well, and was mixed thoroughly by pipetting up anddown. The plate was then centrifuged at 280×g for 1 minute. Ligation Mixfrom storage was not removed until instructed to do so in the procedure.RNA Adapter Indexes are typically stored at −25° C. to −15° C. andthawed at room temperature for 10 minutes prior to use. ResuspensionBuffer and AMPure XP Beads are typically stored at 2° C. to 8° C. andlet stand for 30 minutes to bring to room temperature before use. StopLigation Buffer is typically stored at −25° C. to −15° C. and thawed atroom temperature before use.

Clean Up Ligated Fragments

In brief, 42 μL AMPure XP beads were added to each well and mixedthoroughly by pipetting up and down before incubated at room temperaturefor 15 minutes. After incubation, the mix was centrifuged at 280×g for 1minute. The wells were then placed on a magnetic stand and waited untilthe liquid is clear (about 2-5 minutes). While waiting for the liquid toclear, fresh 80% EtOH was made for use in the two washes step above.After the liquid was cleared, all supernatant was removed and discardedfrom each well, and was wash two times as the following: 1) added 200 μLfresh 80% EtOH to each well, 2) incubated on the magnetic stand for 30seconds, and 3) removed and discarded all supernatant from each well. 20μL pipette was used to remove residual EtOH from each well.

The magnetic stand was air-dried for 5 minutes. The bead pellet did notstart to crack, as this was a sign of over drying. The magnetic standwas then removed. 52.5 μL Resuspension buffer was added to each well andmixed thoroughly by pipetting up and down before incubating at roomtemperature for 2 minutes. The mixed buffer was centrifuged at 280×g for1 minute. A magnetic stand was placed and waited until the liquid wasclear (about 2-5 minutes). 50 μL supernatant was transferred to thecorresponding well of a newly labeled PCR plate. 50 μL AMPure XP beadswere added to the plate and mixed thoroughly by pipetting up and downbefore incubating at room temperature for 15 minutes. The plate wascentrifuged at 280×g for 1 minute. A magnetic stand was placed andwaited until the liquid was clear (2-5 minutes). All supernatant wasremoved and discarded from each well. The well was washed two times asfollowing: 1) added 200 μL fresh 80% EtOH to each well, 2) incubated onthe magnetic stand for 30 seconds, and 3) removed and discarded allsupernatant from each well.

After that, 20 μL pipette was used to remove residual EtOH from eachwell. The magnetic stand was air-dried for 5 minutes. The bead pelletwas ensured to not starting to crack, as this would be a sign of overdrying. The bead pellet was then removed from the magnetic stand. 22.5μL Resuspension Buffer was added to each well and mixed thoroughly bypipetting up and down before incubating at room temperature for 2minutes. The wells then were centrifuged at 280×g for 1 minute. Amagnetic stand was placed and waited until the liquid was clear (2-5minutes). 20 μL supernatant was transferred to the corresponding well ofa newly labeled PCR plate. The beads were not disturbed during theprocess. Alternatively, this step was a safe stopping point. The platecould be sealed and stored at −25° C. to −15° C. for up to 7 days.

Enrich DNA Fragments

The reagents were prepared according to the conditions below. In brief,the PCR plate was placed on ice and 5 μL PCR primer cocktail was addedto each well. 25 μL PCR Master Mix was added to each well, and thenmixed thoroughly by pipetting up and down 10 times. The sample wellswere sealed and centrifuged at 280×g for 1 minute. The sample wells wereplaced on the thermal cycler and the mRNA PCR program was performed. ThemRNA PCR program was as the following: 1) preheat lid: 100° C. holdtime, 2) step 1: 98° C. for 30 seconds, 3) step 2 (15 cycles): 98° C.for 10 seconds, 60° C. for 30 seconds, and 72° C. for 30 seconds, 4)step 3: 72° C. for 5 minutes, and step 4) 4° C. hold time.

Once the program was complete and the plate was centrifuged at 280×g for1 minute. AMPure XP beads were mixed by thorough vortexing and 50 μL wasadded to each well and mixed thoroughly by pipetting up and down 10times before incubating at room temperature for 15 minutes. The samplewells were centrifuged at 280×g for 1 minute. A magnetic stand wasplaced and waited until the liquid was clear (2-5 minutes). Allsupernatant was removed and discarded from each well. The wells werewashed two times as the following: 1) added 200 μL fresh 80% EtOH toeach well, 2) incubated on the magnetic stand for 30 seconds, and 3)removed and discarded all supernatant from each well. A 20 μL pipettewas used to remove residual EtOH from each well. The magnetic stand wasair-dried for 5 minutes. The bead pellet was ensured to not start tocrack, which was a sign of over drying. The magnetic stand was removednext. 32.5 μL Resuspension buffer was added to each well, and was mixedthoroughly by pipetting up and down 10 times before incubating at roomtemperature for 2 minutes. The wells were centrifuged at 280×g for 1minute. A magnetic stand was placed and waited until the liquid wasclear (2-5 minutes). 30 μL supernatant was suspended to thecorresponding well of a newly labeled PCR plate. The lab personnel thenproceeded with library QC using the a nucleic acid amplification device(e.g., a PCR system), for example a real-time PCR system (e.g., aLightCycler Instrument 480 available from Roche,www.lifescience.roche.com), or the plate was sealed and stored at −20°C. for up to 7 days. PCR Primer Cocktail is typically stored at −25° C.to −15° C. and thawed at room temperature before use. PCR Master Mix istypically stored at −25° C. to −15° C. and thawed on ice before use.Resuspension Buffer and AMPure XP Beads are typically stored at 2° C. to8° C. and let stand for 30 minutes to bring to room temperature beforeuse.

Resource from the manufacturers, including TruSeq Stranded mRNAReference Guide, is incorporated by reference herein.

Example 6: Quality Control Concerning the DNA/RNA Library PreparationProcess Based on DNA and RNA from Fresh Frozen Tissue Library Sequencing

For library preparation, extracts of DNA and RNA from the tissue wereobtained by using the AllPrep DNA/RNA Mini Kit. Any suitable extractionkit known in the art could also be used. Library construction frompurified DNA was carried out with Agilent SureSelect XT HS and AgilentSureSelect Human All Exon V7 exome kits. Library construction frompurified RNA was carried out with Illumina TruSeq mRNA stranded kit.Quality control (QC) metrics were carried out after each stage oflibrary preparation. All QC metrics were prepared with aspectrophotometer, for example a small volume full-spectrum, UV-visiblespectrophotometer (e.g., Nanodrop spectrophotometer available fromThermoFisher Scientific, www.thermofisher.com), a fluorometer, forexample for quantification of DNA or RNA (e.g., a Qubit Flex fluorometeravailable from ThermoFisher Scientific, www.thermofisher.com), a nucleicacid amplification device (e.g., a PCR system), for example a real-timePCR system (e.g., a LightCycler Instrument 480 II available from Roche,www.lifescience.roche.com) and an electrophoresis device, for example anautomated electrophoresis device (e.g., a, Agilent TapeStation System4150 available from Agilent, www.agilent.com). All measurement contentcarried purity, concentrations and size of DNA/RNA fragments.

QC metrics during the next generation sequencing experiment were a setof individual parameters that evaluated the overall quality of the dataset. The following metrics were evaluated: cluster density, percentageof clusters passing filters that were assigned to an index, qualityscore of 30 (Q30) and error rate. The next stage was to estimate qualitymetrics used in the bioinformatics pipeline (Bioinformatics QC). It wasdivided into two processes: WES (DNA sequencing) and RNA sequencing(RNA-seq). The following metrics were taken into account: tumor purity,depth of coverage, alignment rate, base call quality scores or Phredscore, uniformity of coverage, GC content, mapping quality, duplicationrate, insert size, contamination, SNP concordance, HLA alleleconcordance and ADA genomes contamination.

In general, the protocol described in the present example provides themetrics for QC methods used in the library preparation stages and in thebioinformatic pipeline of whole-exome sequencing (WES) and RNA-seqanalysis. Bioinformatic pipeline was divided into two components: QCfrom sequencing platform and Bioinformatics QC. On a sequencer platformand in Bioinformatics QC, a table with estimated metrics for WES andRNA-seq data was provided. For some metrics within a targeted range,which were the most appreciable value and acceptable range, the sampledata could be used. If the values of the obtained metrics fell outsideof the acceptable range, the corresponding sample was considered havingpoor quality.

In any given experiment or project, one or more of the quality controlprocesses can be used. In some experiment or project, all of the qualitycontrol processes can be used. Some or all of the procedures weremanaged and conducted by lab personnel. In the event that qualitycontrol related issues arise, lab personnel will notify the provider(e.g., healthcare provider) of the biopsy sample or the extractedDNA/RNA.

Quality Control Steps in the Process of DNA and RNA Library Preparation

Table 4-Table 6 describe embodiments of DNA and RNA library preparationincluding one or more quality control steps at each phase: extraction,library construction, and hybridization and capture. Measuring theconcentrations of extracts, primary libraries and libraries afterhybridization and capture, the quality of products from the samples wereidentified. Based on the determination of the quality of DNA or RNA inthe tested sample, a decision was made to either go forward to the nextstep or to repeat the processes. A spectrophotometer, for example asmall volume full-spectrum, UV-visible spectrophotometer (e.g., Nanodropspectrophotometer available from ThermoFisher Scientific,www.thermofisher.com) a fluorometer, for example for quantification ofDNA or RNA (e.g., a Qubit fluorometer available from ThermoFisherScientific, www.thermofisher.com), an electrophoresis device, forexample an automated electrophoresis device (e.g., a TapeStation Systemavailable from Agilent, www.agilent.com) and a nucleic acidamplification device (e.g., a PCR system), for example a real-time PCRsystem (e.g., a LightCycler Instrument available from Roche,www.lifescience.roche.com) were used for measuring purity,concentrations and size of DNA/RNA fragments. The acceptable and thetargeted ranges of DNA and RNA for the respective devices for each phaseare indicated in the tables below. The results of the quality controlcan be confirmed by performing electrophoresis. The results of thequality control can be confirmed by the determination of the sizedistribution of the nucleic acids.

The present Example provides troubleshooting protocols. For example, ifan additional peak at 150 bp at the electropherogram from theTapeStation on the LC or HC stage was observed. An additional step ofwashing with AMPure beads (library to beads 1:0.8 volume ratio) weremade.

TABLE 4 Quality Control: Extraction of DNA or RNA for LibraryPreparation Extraction DNA (Acceptable|Target) RNA (Acceptable|Target)Total amount for next step Total amount for next step Device 20-200 ng |30-200 ng 100-1000 ng | >200-1000 ng Comments Nanodrop concentration >4concentration >3 If the metrics are out of range ng/ul | >5.5 ng/ulng/ul | >5 ng/ul you have to repeat extraction 260/280 >1.5 | 1.8-2.0260/280 >1.5 | 1.8-2.0 process 260/230 >1.5| 2.0-2.2 260/230 >1.5 |2.0-2.2 Qubit Concentration >3 Concentration >2 ng/ul | >4.5 ng/ul ng/ul| >4 ng/ul Tapestation Concentration >2.5 Concentration >1.5 ng/ul | >4ng/ul ng/ul | >3 ng/ul RIN >5 | >8

TABLE 5 Quality Control: DNA or RNA for Library Construction Libraryconstruction DNA (Acceptable|Target) RNA (Acceptable|Target) Totalamount for next step Total amount for next step Device 200-1000 ng |500-1000 ng 0.5-4 nmol/l | 0.5-4 nmol/l Comments Qubit Concentration >17Concentration >0.1 If the metrics are out of range, ng/ul | >42 ng/ulng/ul | >0.1 ng/ul extraction process can be TapestationConcentration >15 Concentration >0.1 repeated. ng/ul | >40 ng/ul ng/ul| >0.1 ng/ul If an electropherogram with Concentration >0.5 anadditional peak at 150 bp is nmol/l | >0.5 nmol/l observed, thetroubleshooting Average 370-440 | 370-440 as described in the presentLightCycler Not required Concentration >0.5 Example would provide nmol/l| >0.5 nmol/l guidance.

TABLE 6 Quality Control: DNA or RNA for Library after Hybridization andCapture DNA (Acceptable|Target) Final concentration for pooling Device0.5-4 nmol/l | 0.5-4 nmol/l Comments Qubit Concentration >0.1 If themetrics are out of ng/ul | >0.1 ng/ul range, extraction processTapestation Concentration >0.1 can be repeated. ng/ul | >0.1 ng/ul Ifyou see an Concentration >0.5 electropherogram with an nmol/l | >0.5nmol/l additional peak at 150 bp Average 380-440 | 380-440 is observed,look in the LightCycler Concentration >0.5 troubleshooting list nmol/l| >0.5 nmol/l (#1)as described in the present Example would provideguidance.Quality Control Steps after DNA and RNA Library Preparation

The main quality control metrics for the sequencing process occurred onthe Illumina NextSeq® 500/550 sequencer. Table 7 shows the QC parametersof the sample run for whole-genome sequencing (WES) and RNA-sequencing.

TABLE 7 Quality Control: Sequencing Processes Sample run QC parameterTargeted values WES Targeted values RNA-seq Comments Cluster densityTargeted range 170-220 Targeted range 170-220 Cluster density andCLUSTER PF (%) Acceptable range <280 Acceptable range <280 Alignmentrate should be monitored in every run. Actual Yield >15 Gbp The yieldthat's been received % ≥ Q30 Targeted range >85% Targeted range >85% Thepercentage of bases Acceptable range >75% Acceptable range >75% with aquality score of 30 or higher Quality scores and quality of signal/noiseratio should be monitored in every run. ERROR RATE % Targeted <0.7%Targeted <0.7% Refers to the percentage of Acceptable <1% Acceptable <1%bases called incorrectly at any one cycle. It is calculated from thereads that are aligned to Illumina's PhiX control. If this was not usedthen % Q30 would be the best tool to check base quality. Error rateincreases along the length of the read.

Bioinformatics QC

After the performance of sequencing (e.g., RNA seq), quality control canbe performed for bioinformatics pipeline. In brief, the software wascreated and generated by the following parameters for measurements:single nucleotide variants (SNVs; somatic plus germline variants), smallindels, copy number alterations (CNAs) (plus loss of heterozygosity(LOH)), focal amplifications/deletions, gene fusions rearrangement (mRNAexpressed), fusion protein expression, RNA expression (for biomarkerproteins), and tumor mutational burden (TMB).

In any given experiment or project, one or more paramaters could be usedfor quality control. For example, only SNV detection was performed forcertain bioinformatic analysis. Only in/del detection was performed forcertain bioinformatic analysis. Only CNA detection was performed forcertain bioinformatic analysis. Only fusion detection was performed forcertain bioinformatic analysis. Only RNA expression measurementdetection was performed for certain bioinformatic analysis. Only TMBmeasurement detection was performed for certain bioinformatic analysis.In other examples, SNV detection and CAN detection can be performed forcertain bioinformatic analysis.

Table 8 provides a lists of quality control parameters forbioinformatics analysis.

TABLE 8 Quality Control: Bioinformatics Bioinformatics Targeted valuesTargeted values QC parameter WES RNA-seq Comments Tumor purity  >=20% >=20% In case of failure - inform the physician if to proceed withsample below LOD Depth of coverage >=150x average >50 mln pair-end Incase of failure - resequence the coverage - tumor reads samplesample >=100x average coverage of normal tissue Alignment rate   >90%  >90% Base call quality scores Phred score >30 Phred score >30 Qualityscores and quality of signal/noise ratio should be monitored in everyrun. Low-quality scores can lead to increased false-positive variantcalls; thus, results must be interpreted with caution and repeat testingmay be indicated. Uniformity of coverage 85% of base pairs Notapplicable — in target regions covered >=20x for tumor tissue 85% ofbase pairs in target regions covered >=20x for normal tissue GC biasTargeted value 50 Targeted value 50 GC bias should be monitored withAcceptable range: Acceptable range: every run to detect changes in test45-65 45-65 performance or sample quality issues. In the last assemblyof the reference human genome, the GC composition for the entire genomeis ~40%, 48.9% for the RNA encoding. We use ExonV7 kit, the mean GCcontent for the targets there is 47.9%. Mapping quality MapQ >=10 Notapplicable The proportion of reads that do not map to target regionsmust be monitored during each run. Poor mapping quality may be a resultof non-specific amplification, capture of off target DNA, orcontamination. Reads with MapQ = 0 are obtained with simultaneousalignment to several regions. Duplication rate   <30%   <85% Theduplication rate should be monitored in every run and for each sampleindependently to monitor library diversity. Insert size Median insertsize Median insert size ~ Insert size is the length of the for tumortissue ~ [150; 200] sequencing DNA (or RNA) that is [150; 200]“inserted” between the adapters. Median insert size for normal tissue ~[150; 200] Contamination <0.05% <0.05% Contamination - Percentage ofsequence segments of foreign origin in sample. A contaminated sequenceis one that does not faithfully represent the genetic information fromthe biological source SNP concordance of a Targeted >90% Targeted >90%In case of failure, investigate where the pair of samplesAcceptable >85% Acceptable >85% mix-up could have happened. Stop thetumor/normal from the bioinformatics analysis until the same patienttroubleshooting and sample concordance HLA allele concordance Thresholdfor Threshold for tumor In case of failure - Investigate where of a pairof samples normal vs tumor RNAseq vs normal the mix-up could havehappened. Stop tumor/normal from the tissue <5 WES tissue <5 thebioinformatics analysis until the same patient troubleshooting andsample concordance ADA genomes Targeted Targeted threshold >40 For Onehit one genome, contamination threshold >60 Acceptable threshold >20GRCh38.d1.vd1 Acceptable (with WARNING In case of failure - report tolab threshold >40 sign) personnel (with WARNING sign)

The present Example provides troubleshooting protocols. For example, ifthe quality control of HLA allele concordance of tumor/normal pairfailed, it was an indicator that the tissues from different patientswere potentially mixed up. The lab personnel should proceed to confirmthe potential mix up and investigate the reason of the potential mix up.The lab personnel should reach out to the physician if the mix up wasnot due to the internal errors in the laboratory.

Example 7: Determining the Sequence of Major HistocompatibilityComplexes (MHC) can be Used to Assess Sequence Data Identity and/orIntegrity

MHC genes are highly polymorphic, with large numbers of alleles for thegenes of each class (e.g., Class I, II, and III) of MHC (e.g., humanleukocyte antigens (HLAs) in humans). The combination of the number ofpotential alleles in a population with the number of genes in eachindividual result in a large number of unique MHC profiles. These can beused to assess the likelihood that sequence data is from a given sourceor subject (or if sequence data from multiple samples are from the samesubject). Sequences corresponding to one or more MHC loci can be used todetermine an MHC allele combination for a particular nucleic acidsample. The result of the sequencing one or more MHC loci can beevaluated against asserted information (e.g., an asserted HLA allelecombination) which is expected to be consistent with the sequence data.If the determined MHC combination matches the asserted information, thesequence data is consistent. If the determined MHC combination does notmatch the asserted information, the sequence data is inconsistent andthis can indicate a problem with the sample and/or sequence data. Forexample, the sample and/or sequence data may have been contaminated,misidentified, degraded, or otherwise corrupted. This can promptinvestigation into the origin of the inconsistency. Such investigationmay entail determining the sequence of the sequence data at the MHC lociat least one additional time, obtaining a second sequence data from thesample and determining the sequences of the sequence data at the MHCloci at least one additional time, reporting the sequence data asinconsistent, and/or a combination thereof. FIG. 11 illustrates anexample of MHC data validation. In FIG. 11, six HLA alleles aredetermined from each of three sequence data sets (RNA-Seq data, WEStumor data, and WES normal data) from two subjects (e.g., 103 and 105).As can be seen from FIG. 11, all three samples share all six alleles forsubject 105, indicating they are consistent and likely from the samesubject. In the case of subject 103 however, there is consistency fortwo sequence data sets (between whole exome sequence data from a tumorsample (WES Tumor) and whole exome sequence data from a normal sample(WES Normal)), but an inconsistency for a third sequence data set (RNAsequence data allegedly from the same subject 103).

In some embodiments, the sequence of at least one MHC locus isdetermined and verified against at least one MHC sequence of the samelocus from a reference sequence data set (e.g., from a sample assertedto be from the same subject). In some embodiments, two or more MHCalleles loci are sequenced (e.g., at least three, four, or five MHC lociare sequenced). In some embodiments, six MHC loci are sequenced. In someembodiments, more than six MHC loci are sequenced

In some embodiments, the sequence data is from a human subject. In someembodiments, the MHC is human leukocyte antigens (HLA). Accordingly, insome embodiments two or more HLA loci are sequenced (e.g., three, four,five, or more HLA loci). In some embodiments, six HLA loci aresequenced. In some embodiments, more than six HLA loci are sequenced.

In some embodiments, the results are displayed to a user in a report(e.g., via a GUI).

Example 8: Predicted Tumor Type can be Used to Assess Sequence DataIdentity and/or Integrity

Various techniques can be used to predict, based on nucleic acidsequence data, the type of tumor from which a sample was taken (e.g.,breast, colon, prostate, bladder, kidney, rectal, lung, lymphoma,melanoma, oral, oropharyngeal, pancreatic, thyroid, uterine, eye,gastrointestinal, etc.). Many existing tools rely on large data sets ofknown samples with evaluated biomarkers, thereby allowing for thecomparison of biomarkers from a sequence data set to be evaluatedagainst an existing known data set. Other methods of prediction utilizeneural networks and deep learning systems to analyze data sets andperform the data analysis. The sequence data can be trained against anexisting network or data set to predict the type of tumor from which thesequence data was obtained.

The result of a tumor type prediction (e.g., determined information) canthen be evaluated against asserted information (e.g., a tumor type)which is believed to be consistent with the sequence data. If thedetermined information matches the asserted information, the sequencedata is consistent and believed to be identified correctly. If thedetermined information matches the asserted information, the sequencedata can be processed to determine whether the sequence data isindicative of one or more disease features. If the determinedinformation does not match the asserted information, the sequence datais inconsistent and may indicate a problem with the sample or sequencedata. For example, the sample and/or sequence data may have beencontaminated, misidentified, degraded, or otherwise corrupted. This canprompt investigation into the origin of the inconsistency. Suchinvestigation may entail predicting the tumor type from the sequencedata at least one additional time, obtaining a second sequence data setfrom the sample and performing the prediction at least one additionaltime, reporting the sequence data as inconsistent, and/or a combinationthereof.

As can be seen in FIG. 12, a tumor type can be predicted (e.g., BRCAassociated breast cancer) and evaluated in the context of an assertedtumor type or in the context of at least one additional sequence dataset (e.g., reference sequence data, or sequence data from the samesubject and/or the same tumor sample). If the asserted informationmatches the determined information, the data is consistent. If they donot match, it signals a possible inconsistency which may be evaluatedand/or reported to the user. Further, when the determined value isevaluated in the context of additional sequence data, it can be used toevaluate whether the sequence data are from the same subject or source,or from different sources.

Accordingly, in some embodiments a predicted tumor type is determinedfrom the sequence information and evaluated against an asserted tumortype. In some embodiments, the results are displayed to a user in areport (e.g., via a GUI).

Example 9: Ratio of Protein Subunits can be Used to Assess Sequence DataIdentity and/or Quality

Multi-subunit proteins encoded by the nucleic acid can be used toevaluate the sequence data. The expression levels of different subunitsof a protein can be evaluated by determining the expression of eachsubunit (e.g., by determining DNA or RNA levels encoding each subunit)and determining a ratio of the subunits (e.g., by determining a ratio ofDNA or RNA levels encoding different protein subunits in a nucleic acidsample). This ratio (determined information) can then be validatedagainst either asserted information (e.g., an expected ratio) oradditional sequence data. If the ratio matches an expected ratio (e.g.,a ratio either believed to be accurate based on other sequence dataobtained from the subject, or a known ratio for the protein and itsconstituent subunits), the sequence information can be validated. If thedetermined ratio does not match the expected ratio, the sequence data isinconsistent and may indicate a problem with the sample or sequencedata. For example, the sample and/or sequence data may have beencontaminated, misidentified, degraded, or otherwise corrupted. This canprompt an investigation into the origin of the inconsistency. Suchinvestigation may entail determining a new ratio from the sequence dataat least one additional time, obtaining a second sequence data set fromthe sample and determining the ratio at least one additional time,reporting the sequence data as inconsistent, and/or a combinationthereof.

FIG. 13A shows a graph representing expression levels of subunits whichagree with a predicted or known value for the subunits being evaluatedor are within an acceptable or determined threshold for such ratio. FIG.13B shows a graph representing expression levels of subunits whichdisagree with the predicted or known value for the subunits beingevaluated or are outside an acceptable or determined threshold for suchratio.

As can be seen in FIG. 13A, nucleic acids encoding protein subunits canbe evaluated against a known ratio (e.g., an existing measured value, ora theoretical value based on sequences known in the art) or can beevaluated against measured data from known samples (e.g., fit to a lineas shown). When the ratio falls within accepted or establishedthresholds for variability and deviation, it is identified asconsistent. As can be seen in FIG. 13B, nucleic acids encoding proteinsubunits can evaluated against a known ratio (e.g., existing measuredvalue, or theoretical value based on sequences known in the art), or canbe evaluated against measured data from known samples (e.g., fit againsta line as shown). When the ratio falls outside accepted or establishedthresholds for variability and deviation, it can be identified asinconsistent. In some embodiments, at least one ratio is determined. Insome embodiments, nucleic acids encoding a second protein and/or itssubunits are evaluated to determine a second ratio. In some embodiments,nucleic acids encoding a third protein and/or its subunits are evaluatedto determine a third ratio. In some embodiments, nucleic acids encodinga fourth protein and/or its subunits are evaluated to determine a fourthratio. In some embodiments, nucleic acids encoding at least oneadditional protein and/or its subunits are used to determine at leastone additional ratio.

In some embodiments, the subunits used to determine the ratio are CD3subunits CD3D and CD3G. In some embodiments, the subunits used todetermine the ratio are CD3 subunits CD3E and CD3D. In some embodiments,the subunits used to determine the ratio are CD3 subunits CD3G and CD3E.In some embodiments, the subunits used to determine the ratio are CD8subunits CD8B and CD8A. In some embodiments, the subunits used todetermine the ratio are the CD79 subunits CD79A and CD79B.

In some embodiments, the results are displayed to a user in a report(e.g., via a GUI).

Example 10: Polyadenylation Status can be Used to Assess Sequence DataIdentity and/or Integrity

PolyA status can be used to evaluate the sequence data. The sequencedata can evaluated to determine whether different genes arepolyadenylated are present or not (e.g., histone genes, mitochondrialgenes). This analysis can be used to evaluate and or assess thelikelihood that an asserted sample preparation protocol is correct(e.g., to validate whether an RNA sample is a polyA or a total RNAsample). If the determined polyA status matches the asserted polyAstatus, the sequence data is validated as consistent. If the determinedpolyA status does not match the asserted polyA status, the sequence datais identified as inconsistent and may indicate a problem with the sampleor sequence data. Additionally, in the instance where ambiguous resultsare returned for the polyA status (e.g., where polyadenylated genes arefound, but others are not, or where unanticipated expression is found,or where less than expected expression is found (e.g., partialexpression)), it may indicate problems with the sample preparation,degradation of the sample from which the sequence data was prepared, orother quality issues. For example, the sample and/or sequence data mayhave been contaminated, misidentified, degraded, or otherwise corrupted.This can prompt an investigation into the origin of the inconsistency.Such investigation may entail determining a polyA status from thesequence data at least one additional time, obtaining a second sequencedata from the sample and determining the polyA status at least oneadditional time, reporting the sequence data as inconsistent, and/or acombination thereof.

FIGS. 14A-14B show examples of bar graphs representing the probabilitythat sequence information was obtained from samples that contained onlypolyadenylated RNA or from samples that contained total or all RNA(total RNA). FIG. 14A shows positive results (indicating sequences whichappear uniform) from the analysis of two different sequences. The leftset of bars (bars 1-20, as read left to right) show results from asequence which has a high probability of being from samples whichcontained primarily polyadenylated RNA. The right set of bars (bars21-40, as read left to right) show results from a sequence which has ahigh probability of being from samples which contained primarily totalRNA. FIG. 14B shows poor results (indicating, for example, possiblecontamination or degradation) from the analysis of two differentsequences. The outlined box, tagged “Bad,” shows probability of thesequences as being from polyadenylated RNA about 50%, indicating it isindeterminate that the sequences are uniform.

As can be seen in FIG. 14A, the sequence data can be evaluated and apolyA status can be determined as either polyadenylated or total RNA insome embodiments. FIG. 14B shows an example where the determinationfalls below a threshold of either polyadenylated or total RNA (e.g., 50%polyA, 50% total RNA). In this case, the sequence data can be identifiedas inconsistent and/or of poor quality and may signal a problem with thenucleic acid sample. Accordingly, in some embodiments the sequence datais identified as polyadenylated sequence data. In some embodiments, thesequence data is identified as total RNA sequence data. In someembodiments, the threshold for identifying a sample as polyA is when thepercent polyA RNA in a sample is above 50%. In some embodiments, thethreshold is 60%. In some embodiments, the threshold is 70%. In someembodiments, the threshold is 80%. In some embodiments, the threshold is90%. In some embodiments, the threshold is 95%. In some embodiments, thethreshold is 96%. In some embodiments, the threshold is 97%. In someembodiments, the threshold is 98%. In some embodiments, the threshold is99%. In some embodiments, the results are displayed to a user in areport (e.g., via a GUI).

Example 11: Exon Coverage can be Used to Assess Sequence Data Identityand/or Integrity

Various techniques can be used for evaluating the consistency of dataand/or to group data points for analysis (e.g., in some embodimentsprincipal component analysis (PCA) can be used). Such techniques can beuseful in evaluating sequence data for identity and/or integrity. Forinstance, exon coverage can be determined from the sequence informationand evaluated to determine whether there is a consistent level ofcoverage when compared to other sequence information or to an asserted(e.g., expected) coverage level. An inconsistency in the coverage (e.g.,higher or lower coverage than expected) could indicate that the sequencedata is from a different source than expected (e.g., than asserted), orthat there is a problem with the sequence data or the sample from whichit was obtained.

Exon coverage can be determined for different batches of sequence datareads from a given subject and plotted against sequence data from othersubjects.

In some embodiments, the results of the evaluation are presented to auser in a report (e.g., via a GUI).

Example 12: RNAseq Read Distribution and Composition can be Used toAssess Sequence Data Identity and/or Integrity

In some embodiments, read composition can be evaluated in the context ofthe number of reads of a given component (e.g., protein coding sequence)of the sequence data in terms of either total number of reads for thatcomponent and/or as a relative percentage of that component calculatedagainst the total number of reads). These can be compared against athreshold established for each parameter (e.g., total number of reads,and/or reads of a component relative to the total number of reads).

In some embodiments, a threshold is 20 million total reads per proteincoding region. In some embodiments, a threshold for the relative numberof reads of a protein coding region compared to the total number ofreads in a sample is 50% or more. In some embodiments, the results aredisplayed to a user in a report (e.g., via a GUI).

Example 13: Biomarkers can be Used to Assess Sequence Identity and/orIntegrity

Biomarkers can also be assessed to evaluate the quality and/or identityof the sequence data. As shown in FIG. 15, PCA can be performed toevaluate the expression of biomarkers. The results can be compared ortrained against existing data sets of similar cohorts. The evaluationcan be used to help validate an asserted information and/or one or moreadditional sequence data sets. In some embodiments, this will be usefulto determine with increased likelihood that the sequence information isfrom a given source or subject. In contrast, inconsistency (e.g., if theevaluation does not match the asserted information and/or one or moreadditional sequence data sets) may indicate that there is a potentialquality issue related to the data that should be further investigated toidentify the source of the inconsistency (and/or that that the datashould not be used for further analysis).

In some embodiments, the biomarker is for follicular lymphoma. In someembodiments, the results are displayed to a user in a report (e.g., viaa GUI).

Example 14: Non-Limiting Examples of Quality Control Metrics forAssessing Sequence Data Identity and/or Integrity

In some embodiments, the disclosure relates to a method wherein at leastone of the following additional features is determined: (1) mean qualityscore; (2) contamination value; (3) GC content; (4) duplication level;(5) gene body coverage; and (6) per chromosome coverage. One or more ofthese determinations can be used to further assess the source orintegrity of the sequence data by comparison with a reference, or bycomparison with at least one additional sequence data set.

In some embodiments, at least one additional feature is determined. Insome embodiments, at least two additional features are determined. Insome embodiments, at least three additional features are determined. Insome embodiments, at least four additional features are determined. Insome embodiments, at least five additional features are determined. Insome embodiments, at least six additional features are determined.

In some embodiments, evaluation of a concordance value of singlenucleotide polymorphisms (SNPs) comprises: (a) determining a concordancevalue of single nucleotide polymorphisms (SNPs) from the sequence data;and (b) determining whether the concordance value of the sequence datamatches or exceeds a reference concordance value. In some embodiments,the reference concordance value is 80%.

In some embodiments, evaluation of a contamination value comprises; (a)determining a contamination value of the sequence data; and (b)determining whether the contamination value is less than a referencecontamination value. In some embodiments, the reference contaminationvalue is 10%.

In some embodiments, evaluation of a complexity value comprises: (a)determining a complexity value of the sequence data; and (b) determiningwhether the complexity value matches a reference complexity value.

In some embodiments, evaluation of a Phred Score comprises: (a)determining a Phred Score of the sequence data; and (b) determiningwhether the Phred Score matches or exceeds a reference Phred Score.

In some embodiments, evaluation of a GC content comprises: (a)determining a GC content of the sequence data; and (b) determining ifthe GC content matches a reference GC content.

In some embodiments, the methods further comprise generating a report todisplay the results of the at least one additional determination to auser (e.g., via a GUI).

Example 15: Non-Limiting Protocol for Assessing Sequencing Data QualityControl

In some embodiments, a quality protocol for sequence data (e.g., for WESand/or RNAseq data) comprises one or more of the following steps:

-   -   i) In some embodiments, low-quality reads (for example, based on        positional information) are removed. In some embodiments,        low-quality sequences (e.g., reads from low-quality areas of a        sequencing flow cell) are removed from sequence data (e.g., from        a FASTQ file). In some embodiments, if a significant fraction of        the sequence reads are of low-quality (e.g., if bad tiles        represent more than 30%, more than 40%, or more than 50% of a        sequence data file).    -   ii) In some embodiments, quality control tool for sequence data        (e.g., FastQC as an example) is used to evaluate one or more of        library complexity (e.g., read counts); quality of the        sequencing platform (e.g., based on a per base Phred quality        score); per tile quality score; per sequence GC content (for        example to detect contamination based on an unexpected GC        content), per base sequencing content (e.g., to detect adapter        or other contamination); sequence duplication levels (e.g., to        evaluate the quality of RNA/DNA selection and/or PCR        amplification); and/or adapter content. In some embodiments, a        quality threshold for further analysis includes greater than 10        million read counts (e.g., greater than 20 million read counts),        and/or a Phred score of greater than 25 (e.g., 28 or greater        than 28) in more than 30% of reads (e.g., in 50% of reads or        more than 50% reads). In some embodiments, a quality control        pipeline is stopped if the quality threshold is not met.    -   iii) In some embodiments, sequence data is screened against a        library of sequences (e.g., using FastQ Screen, from Babraham        Bioinformatics), for example to detect cross-species        contamination (e.g., from other sources such as mouse,        zebrafish, Drosophila, C. elegans, Saccharomyces, Arabidopsis,        microbiome, adapters, vectors, phiX, or other source). In some        embodiments, a quality control threshold for further analysis        based on cross-species contamination is set at around 10%,        around 20%, around 30%, or higher. For example, in some        embodiments, a quality control pipeline is stopped if sequence        data comprises 30% or greater than 30% contamination (e.g., with        bacterial sequence).    -   iv) In some embodiments, per chromosome coverage distribution        and/or coverage distribution is determined for one or more        specific regions (e.g., one or more CCDS protein coding regions,        exons, etc.) using an analytical tool (e.g., Mosdepth). In some        embodiments, a quality control threshold for further analysis        involves confirming that sequence data covers clinically        important genomic regions. In some embodiments, a quality        control pipeline is stopped if sequence coverage does not        include one or more target genomic regions of interest.    -   v) In some embodiments, an analytical tool (e.g., Picard) is        used to evaluate one or more sequence data parameters such as        insert size, duplicates, mapping, pairing, or other        parameter(s).    -   vi) In some embodiments, an analytical tool for evaluating RNA        sequence data (e.g., RseQC as an example) is used, for example,        to determine insert size (e.g., inner distance between paired        RNA reads), strandedness (e.g., to determine or confirm whether        a stranded or non-stranded RNA sequence protocol was used),        and/or gene body coverage (e.g., to determine coverage bias, for        example associated with an RNA extraction protocol, for example        to distinguish polyA versus total RNA sequence data).    -   vii) In some embodiments, a quality threshold for RNA analysis        comprises determining the percentage duplicates and/or adapter        contamination and proceeding with further analysis for RNA        sequence data that has less than 70% (for example less than 60%,        or less than 50%) duplicates and/or less than 25% (for example        less than 20%, less than 15%, or less than 10%) adapter        contamination. Accordingly, an analysis protocol is terminated,        in some embodiments, when RNA sequence data that has more than        50% (e.g., 60% or more than 60%, or more than 70%) duplicates        and/or more than 10% (e.g., more than 15%, 20% or more than 20%,        or more than 30%) adapter contamination.    -   viii) In some embodiments, cross-individual contamination is        evaluated (e.g., using a concordance and/or contamination        estimator such as Conpair), for example to determine the        concordance of a pair of samples (e.g., tumor and normal)        obtained from the same patient. In some embodiments, further        analysis is performed if the normal and tumor samples (e.g.,        normal and tumor DNA) are identified as being from the same        subject.    -   ix) In some embodiments, a tumor-type classifier is used to        predict a tumor type from gene expression data of a sample, and        the predicted tumor type is compared to the asserted tumor type        (e.g., the tumor type provided along with the nucleic acid        data). In some embodiments, further analysis is performed if the        predicted and asserted tumor types match.    -   x) In some embodiments, an RNA sequence type classifier is used        to predict the library type from RNA sequence data (e.g., based        on specific gene expression levels or patterns). In some        embodiments, further analysis is performed if the predicted        library type matches an asserted library type for the sample        being analyzed.    -   xi) In some embodiments, an MHC allele composition is determined        for two or more samples (e.g., from tumor and/or normal tissue)        from the same subject. In some embodiments, further analysis is        performed if the MHC allele compositions for the two or more        samples match.

In some embodiments, one or more of the steps described above areperformed. If sequence data (e.g., RNA and/or DNA sequence data) failsto satisfy one or more of these quality control steps, the sequence datacan be excluded from further analysis. In some embodiments, additionalsequence data can be obtained for a subject for which an initial set ofsequence data did not satisfy one or more quality control criteria.

Example Embodiments

Some embodiments provide for a method comprising: obtaining a firstsample of a first tumor from a subject having, suspected of having, orat risk of having cancer; extracting RNA from the first sample of thefirst tumor; enriching the RNA for coding RNA to obtain enriched RNA;preparing a first library of DNA fragments from the enriched RNA fornon-stranded RNA sequencing; and performing non-stranded RNA sequencingon the first library of DNA fragments prepared from the enriched RNA.

In some embodiments, the method further comprises extracting DNA fromthe first sample of the tumor; preparing a second library of DNAfragments from the extracted DNA; and performing whole exome sequencing(WES) on the second library of DNA fragments.

In some embodiments, the method further comprises: obtaining a firstsample of blood from the subject; extracting DNA from the first sampleof blood; preparing a third library of DNA fragments from the DNAextracted from the first sample of blood; and performing whole exomesequencing (WES) on the third library of DNA.

In some embodiments, the method further comprises obtaining a secondsample of a second tumor from the subject. In some embodiments, thefirst tumor and the second tumor are a same tumor. In some embodiments,the first and second tumors are different tumors.

In some embodiments, the method further comprises combining the firstand second tumor samples to form a combined tumor sample, and extractingthe RNA comprises extracting the RNA from the combined tumor sample.

In some embodiments, the method further comprises: extracting RNA fromthe second sample; and combining the RNA extracted from the secondsample with the RNA extracted from the first sample to form combinedextracted RNA, and wherein enriching the RNA for coding RNA comprisesenriching the combined extracted RNA for coding RNA. In someembodiments, the method further comprises: extracting DNA from thesecond tumor sample; and combining the DNA extracted from the secondtumor sample with the DNA extracted from the first tumor sample to formcombined extracted DNA, and preparing a second library of DNA fragmentsfrom the extracted DNA comprises preparing a library of DNA fragmentfrom the combined extracted DNA.

In some embodiments, the method further comprises placing the firstsample in a first cryogenic tube, the first cryogenic tube comprising acomposition that is able to penetrate the sample and protect DNA and/orRNA therein from degradation. In some embodiments, the method furthercomprises snap freezing the contents of the first cryogenic tube.

In some embodiments, the method further comprises placing the firstsample of blood in a vacutainer comprising an anticoagulant. In someembodiments, the method further comprises snap freezing the contents ofthe vacutainer. In some embodiments, the snap-frozen contents of thecryogenic tube and/or vacutainer are stored for up to 7 months at −65°C. to −80° C.

In some embodiments, the first tumor sample is at least 20 mg in weight,consist of at least 2×10⁶ cells, or provides at least 1 μg of RNA uponRNA extraction.

In some embodiments, the method further comprises: forming a single-cellsuspension of cells from the first sample of tumor; and performing masscytometry on at least a first part of the single-cell suspension, the atleast the first part of the single-cell suspension comprising at least5×10⁶ cells.

In some embodiments, the method further comprises forming a lysate fromat least a second part of the single-cell suspension, the at least thesecond part of the single-cell suspension comprising at least 2×10⁶cells; extracting RNA from the lysate; performing RNA sequencing on theextracted RNA to obtain RNA expression data; and/or determining whetherthe first tumor is heterogeneous based on the RNA expression data.

In some embodiments, forming the single-cell suspension of cellscomprises: dissecting the first tumor sample to obtain tumor samplefragments; incubating the tumor sample fragments in an enzyme cocktail,the enzyme cocktail comprising penicillin and/or streptomycin,collagenase I, and collagenase IV; and filtering the enzyme cocktailthrough a 70 μm cell strainer.

In some embodiments, the first sample of blood is at least 0.5-1.0 ml involume.

In some embodiments, the RNA extracted from either the first sample orthe second sample is at least 1000-6000 ng in total mass, has a puritycorresponding to a ratio of absorbance at 260 nm to absorbance at 280 nmof at least 2.0.

In some embodiments, the DNA extracted from the first sample is at least1000-2000 ng in total mass, in at least 10 μl of solution, of aconcentration of 100-200 ng/μl, and has a purity corresponding to aratio of absorbance at 260 nm to absorbance at 280 nm of at least 1.8.

In some embodiments, enriching the RNA for coding RNA comprisesperforming polyA enrichment.

In some embodiments, the WES performed on the second library of DNAfragments, and the WES performed on the third library of DNA fragmentshave at least 100 bp paired-end reads, and an estimated coverage of atleast 100×.

In some embodiments, the WES has an estimated coverage of at least 150×.

In some embodiments, the RNA sequencing on the first library of DNAfragments has at least 100 bp paired-end reads, and an estimated totalnumber of reads of at least 50 million paired-end reads.

In some embodiments, the RNA sequencing on the first library of DNAfragments at least 100 bp paired-end reads, and an estimated totalnumber of reads of at least 100 million paired-end reads.

In some embodiments, the method further comprises subjecting a sample ofany one of the prepared libraries of DNA fragments to quality controltests to evaluate their integrity and/or peak size, wherein each sampleof the prepared libraries comprises up to 1 ng of a library.

In some embodiments, the subject is human.

Some embodiments provide for a kit, comprising: a composition that isable to penetrate tissue and protect DNA and/or RNA therein fromdegradation; at least one tool for dissecting a sample of tumor andpreparing a single-cell suspension therefrom; at least one reagent forsnap-freezing of biological samples; an anticoagulant; at least onevacutainer; at least one reagent for extracting DNA and RNA from tissuesamples and blood; and at least one reagent for preparing DNA librariesfrom DNA and/or RNA samples.

Some embodiments provide for a kit for use in a method according to anyof the preceding examples.

Some embodiments provide for a system comprising at least one computerhardware processor and at least one non-transitory computer-readablestorage medium storing processor-executable instructions that, whenexecuted by the at least one computer hardware processor, cause the atleast one computer hardware processor to perform a method for processingRNA expression data. The method comprises using at least one computerhardware processor to perform: obtaining RNA expression data for asubject having, suspected of having, or at risk of having cancer;aligning and annotating genes in the RNA expression data with knownsequences of the human genome to obtain annotated RNA expression data;removing non-coding transcripts from the annotated RNA expression data;converting the annotated RNA expression data to gene expression data intranscripts per kilobase million (TPM); identifying at least one genethat introduces bias in the gene expression data; removing the at leastone gene from the gene expression data to obtain bias-corrected geneexpression data; and identifying a cancer treatment for the subjectusing the bias-corrected gene expression data.

Some embodiments provide for at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor to performa method for processing RNA expression data. The method comprises usingat least one computer hardware processor to perform: obtaining RNAexpression data for a subject having, suspected of having, or at risk ofhaving cancer; aligning and annotating genes in the RNA expression datawith known sequences of the human genome to obtain annotated RNAexpression data; removing non-coding transcripts from the annotated RNAexpression data; converting the annotated RNA expression data to geneexpression data in transcripts per kilobase million (TPM); identifyingat least one gene that introduces bias in the gene expression data;removing the at least one gene from the gene expression data to obtainbias-corrected gene expression data; and identifying a cancer treatmentfor the subject using the bias-corrected gene expression data.

Some embodiments provide for a method for processing RNA expressiondata, the method comprising using at least one computer hardwareprocessor to perform: obtaining RNA expression data for a subjecthaving, suspected of having, or at risk of having cancer; aligning andannotating genes in the RNA expression data with known sequences of thehuman genome to obtain annotated RNA expression data; removingnon-coding transcripts from the annotated RNA expression data;converting the annotated RNA expression data to gene expression data intranscripts per kilobase million (TPM); identifying at least one genethat introduces bias in the gene expression data; removing the at leastone gene from the gene expression data to obtain bias-corrected geneexpression data; and identifying a cancer treatment for the subjectusing the bias-corrected gene expression data.

In some embodiments, identifying the at least one gene from the geneexpression data comprises identifying at least one gene having anaverage transcript length at least a threshold amount higher or lowerthan an average length of transcripts in the gene expression data. Insome embodiments, identifying the at least one gene from the geneexpression data comprises identifying at least one gene having at leasta threshold variation in average transcript expression level based ontranscript expression levels in reference samples.

In some embodiments, identifying the at least one gene from geneexpression data comprises identifying one or more genes having a polyAtail that is at least a threshold amount smaller in length compared toan average length of polyA tails of genes from a sample from which theRNA expression data was obtained.

In some embodiments, the at least one gene belongs to a family of genesselected from the group consisting of: histone-encoding genes,mitochondrial genes, interleukin-encoding genes, collagen-encodinggenes, B-cell receptor-encoding genes, and T cell receptor-encodinggenes.

In some embodiments, the at least one gene comprises at least onehistone-encoding gene selected from the group consisting of: HIST1H1A,HIST1H1B, HIST1H1C, HIST1H1D, HIST1H1E, HIST1H1T, HIST1H2AA, HIST1H2AB,HIST1H2AC, HIST1H2AD, HIST1H2AE, HIST1H2AG, HIST1H2AH, HIST1H2AI,HIST1H2AJ, HIST1H2AK, HIST1H2AL, HIST1H2AM, HIST1H2BA, HIST1H2BB,HIST1H2BC, HIST1H2BD, HIST1H2BE, HIST1H2BF, HIST1H2BG, HIST1H2BH,HIST1H2BI, HIST1H2BJ, HIST1H2BK, HIST1H2BL, HIST1H2BM, HIST1H2BN,HIST1H2BO, HIST1H3A, HIST1H3B, HIST1H3C, HIST1H3D, HIST1H3E, HIST1H3F,HIST1H3G, HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4A, HIST1H4B, HIST1H4C,HIST1H4D, HIST1H4E, HIST1H4F, HIST1H4G, HIST1H4H, HIST1H4I, HIST1H4J,HIST1H4K, HIST1H4L, HIST2H2AA3, HIST2H2AA4, HIST2H2AB, HIST2H2AC,HIST2H2BE, HIST2H2BF, HIST2H3A, HIST2H3C, HIST2H3D, HIST2H3PS2,HIST2H4A, HIS T2H4B, HIST3H2A, HIST3H2BB, HIST3H3, and HIST4H4.

In some embodiments, the at least one gene comprises at least onemitochondrial gene selected from the group consisting of: MT-ATP6,MT-ATP8, MT-CO1, MT-CO2, MT-CO3, MT-CYB, MT-ND1, MT-ND2, MT-ND3, MT-ND4,MT-ND4L, MT-ND5, MT-ND6, MT-RNR1, MT-RNR2, MT-TA, MT-TC, MT-TD, MT-TE,MT-TF, MT-TG, MT-TH, MT-TI, MT-TK, MT-TL1, MT-TL2, MT-TM, MT-TN, MT-TP,MT-TQ, MT-TR, MT-TS1, MT-TS2, MT-TT, MT-TV, MT-TW, MT-TY, MTRNR2L1,MTRNR2L10, MTRNR2L11, MTRNR2L12, MTRNR2L13, MTRNR2L3, MTRNR2L4,MTRNR2L5, MTRNR2L6, MTRNR2L7, and MTRNR2L8.

In some embodiments, the RNA expression data is characterized by atleast 100 bp paired-end reads, and an estimated coverage of at least 50million paired-end reads.

In some embodiments, the RNA expression data is characterized by atleast 100 bp paired-end reads, and an estimated total number of reads ofat least 100 million paired-end reads.

In some embodiments, aligning genes in the RNA expression data isperformed using a GRCh38 genome assembly.

In some embodiments, annotating the genes in the RNA expression data isbased on GENCODE V23 comprehensive annotation (www.gencodegenes.org).

In some embodiments, the removed non-coding transcripts belong to groupsselected from the list consisting of: pseudogenes, polymorphicpseudogenes, processed pseudogenes, transcribed processed pseudogenes,unitary pseudogenes, unprocessed pseudogenes, transcribed unitarypseudogenes, constant chain immunoglobulin (IG C) pseudogenes, joiningchain immunoglobulin (IG J) pseudogenes, variable chain immunoglobulin(IG V) pseudogenes, transcribed unprocessed pseudogenes, translatedunprocessed pseudogenes, joining chain T cell receptor (TR J)pseudogenes, variable chain T cell receptor (TR V) pseudogenes, smallnuclear RNAs (snRNA), small nucleolar RNAs (snoRNA), microRNAs (miRNA),ribozymes, ribosomal RNA (rRNA), mitochondrial tRNAs (Mt tRNA),mitochondrial rRNAs (Mt rRNA), small Cajal body-specific RNAs (scaRNA),retained introns, sense intronic RNA, sense overlapping RNA,nonsense-mediated decay RNA, non-stop decay RNA, antisense RNA, longintervening noncoding RNAs (lincRNA), macro long non-coding RNA (macrolncRNA), processed transcripts, 3prime overlapping non-coding RNA(3prime overlapping ncrna), small RNAs (sRNA), miscellaneous RNA (miscRNA), vault RNA (vaultRNA), and TEC RNA.

In some embodiments, the RNA expression data has been obtained byperforming RNA sequencing on one or more samples of a subject's tumor.

In some embodiments, identifying the cancer treatment for the subjectusing the bias-corrected gene expression data comprises: determining,using the bias-corrected gene expression data, a gene group expressionlevel for each gene group in a set of gene groups, wherein the set ofgene group comprises at least one gene group associated with cancermalignancy, and at least one gene group associated with cancermicroenvironment; and identifying the cancer treatment using thedetermined gene group expression levels. In some embodiments, the methodfurther comprises: administering the cancer treatment to the subject.

Some embodiments provide for a method comprising: enriching RNA forcoding RNA in a sample of extracted RNA from a first tumor sample from asubject having, suspected of having, or at risk of having cancer;performing non-stranded RNA sequencing on a first library of cDNAfragments prepared from the enriched RNA to obtain RNA expression data;converting the RNA expression data to gene expression data intranscripts per kilobase million (TPM); identifying at least one genethat introduces bias in the gene expression data; removing, from thegene expression data, expression data associated with the at least onegene to obtain bias-corrected gene expression data; and identifying atherapy for the subject using the bias-corrected gene expression data.In some embodiments, the method further comprises administering to thesubject the identified therapy.

In some embodiments, identifying the therapy for the subject using thebias-corrected gene expression data comprises: determining, using thebias-corrected gene expression data, a plurality of gene groupexpression levels comprising a gene group expression level for each genegroup in a set of gene groups, wherein the set of gene groups comprisesat least one gene group associated with cancer malignancy, and at leastone gene group associated with cancer microenvironment; and identifyingthe therapy using the determined plurality of gene group expressionlevels.

In some embodiments, identifying the at least one gene that introducesbias in the gene expression data comprises identifying at least one genehaving an average transcript length at least a threshold amount higheror lower than an average length of transcripts in the gene expressiondata.

In some embodiments, identifying the at least one gene that introducesbias in the gene expression data comprises identifying at least one genehaving at least a threshold variation in average transcript expressionlevel based on transcript expression levels in reference samples.

In some embodiments, identifying the at least one gene comprisesidentifying one or more genes having a polyA tail that is at least athreshold amount smaller in length compared to an average length ofpolyA tails of genes from a sample from which the RNA expression datawas obtained.

In some embodiments, the at least one gene belongs to a family of genesselected from the group consisting of: histone-encoding genes,mitochondrial genes, interleukin-encoding genes, collagen-encodinggenes, B-cell receptor-encoding genes, and T cell receptor-encodinggenes.

In some embodiments, the at least one gene comprises at least onehistone-encoding gene selected from the group consisting of: HIST1H1A,HIST1H1B, HIST1H1C, HIST1H1D, HIST1H1E, HIST1H1T, HIST1H2AA, HIST1H2AB,HIST1H2AC, HIST1H2AD, HIST1H2AE, HIST1H2AG, HIST1H2AH, HIST1H2AI,HIST1H2AJ, HIST1H2AK, HIST1H2AL, HIST1H2AM, HIST1H2BA, HIST1H2BB,HIST1H2BC, HIST1H2BD, HIST1H2BE, HIST1H2BF, HIST1H2BG, HIST1H2BH,HIST1H2BI, HIST1H2BJ, HIST1H2BK, HIST1H2BL, HIST1H2BM, HIST1H2BN,HIST1H2BO, HIST1H3A, HIST1H3B, HIST1H3C, HIST1H3D, HIST1H3E, HIST1H3F,HIST1H3G, HIST1H3H, HIST1H3I, HIST1H3J, HIST1H4A, HIST1H4B, HIST1H4C,HIST1H4D, HIST1H4E, HIST1H4F, HIST1H4G, HIST1H4H, HIST1H4I, HIST1H4J,HIST1H4K, HIST1H4L, HIST2H2AA3, HIST2H2AA4, HIST2H2AB, HIST2H2AC,HIST2H2BE, HIST2H2BF, HIST2H3A, HIST2H3C, HIST2H3D, HIST2H3PS2,HIST2H4A, HIST2H4B, HIST3H2A, HIST3H2BB, HIST3H3, and HIST4H4.

In some embodiments, the at least one gene comprises at least onemitochondrial gene selected from the group consisting of: MT-ATP6,MT-ATP8, MT-CO1, MT-CO2, MT-CO3, MT-CYB, MT-ND1, MT-ND2, MT-ND3, MT-ND4,MT-ND4L, MT-ND5, MT-ND6, MT-RNR1, MT-RNR2, MT-TA, MT-TC, MT-TD, MT-TE,MT-TF, MT-TG, MT-TH, MT-TI, MT-TK, MT-TL1, MT-TL2, MT-TM, MT-TN, MT-TP,MT-TQ, MT-TR, MT-TS1, MT-TS2, MT-TT, MT-TV, MT-TW, MT-TY, MTRNR2L1,MTRNR2L10, MTRNR2L11, MTRNR2L12, MTRNR2L13, MTRNR2L3, MTRNR2L4,MTRNR2L5, MTRNR2L6, MTRNR2L7, and MTRNR2L8.

In some embodiments, the RNA expression data is characterized by atleast 100 bp paired-end reads, and an estimated read depth of at least50 million paired-end reads.

In some embodiments, the method further comprises: aligning andannotating genes in the RNA expression data with known sequences of thehuman genome to obtain annotated RNA expression data before identifyingat least one gene that introduces bias in the gene expression data,aligning genes in the RNA expression data is performed using a GRCh38genome assembly, and annotating the genes in the RNA expression data isperformed using a GENCODE V23 comprehensive annotation(www.gencodegenes.org).

In some embodiments, the method further comprises: removing non-codingtranscripts from the RNA expression data, wherein the removed non-codingtranscripts belong to groups selected from the list consisting of:pseudogenes, polymorphic pseudogenes, processed pseudogenes, transcribedprocessed pseudogenes, unitary pseudogenes, unprocessed pseudogenes,transcribed unitary pseudogenes, constant chain immunoglobulin (IG C)pseudogenes, joining chain immunoglobulin (IG J) pseudogenes, variablechain immunoglobulin (IG V) pseudogenes, transcribed unprocessedpseudogenes, translated unprocessed pseudogenes, joining chain T cellreceptor (TR J) pseudogenes, variable chain T cell receptor (TR V)pseudogenes, small nuclear RNAs (snRNA), small nucleolar RNAs (snoRNA),microRNAs (miRNA), ribozymes, ribosomal RNA (rRNA), mitochondrial tRNAs(Mt tRNA), mitochondrial rRNAs (Mt rRNA), small Cajal body-specific RNAs(scaRNA), retained introns, sense intronic RNA, sense overlapping RNA,nonsense-mediated decay RNA, non-stop decay RNA, antisense RNA, longintervening noncoding RNAs (lincRNA), macro long non-coding RNA (macrolncRNA), processed transcripts, 3prime overlapping non-coding RNA(3prime overlapping ncrna), small RNAs (sRNA), miscellaneous RNA (miscRNA), vault RNA (vaultRNA), and TEC RNA.

In some embodiments, the method further comprises: obtaining a firstsample of a first tumor from a subject having or suspected of havingcancer, and extracting RNA from the first sample of the first tumor toobtain the sample of extracted RNA; before enriching the RNA for codingRNA. In some embodiments, the method further comprises obtaining asecond sample of a second tumor from the subject.

In some embodiments, the method further comprises: combining the firstand second samples to form a combined tumor sample, and extracting theRNA comprises extracting the RNA from the combined tumor sample.

In some embodiments, the method further comprises: extracting RNA fromthe second sample; combining the RNA extracted from the second samplewith the RNA extracted from the first sample to form combined extractedRNA, and enriching the RNA for coding RNA comprises enriching thecombined extracted RNA for coding RNA.

In some embodiments, the sample of extracted RNA comprises at least 1 μgof RNA upon RNA extraction.

In some embodiments, the extracted RNA is at least 1000-6000 ng in totalmass, has a purity corresponding to a ratio of absorbance at 260 nm toabsorbance at 280 nm of at least 2.0.

In some embodiments, enriching the RNA for coding RNA comprisesperforming polyA enrichment.

Some embodiments provide for a system, comprising: at least one computerhardware processor; and at least one non-transitory computer-readablestorage medium storing processor-executable instructions that, whenexecuted by the at least one computer hardware processor, cause the atleast one computer hardware processor to perform: a method, comprising:(a) obtaining nucleic acid data comprising: (i) sequence data comprisingat least 5 kilobases (kb) of DNA and/or RNA, the sequence data obtainedby sequencing a biological sample of a subject having, suspected ofhaving, or at risk of having a disease; and (ii) asserted informationindicating an asserted source and/or an asserted integrity of thesequence data; and (b) validating the nucleic acid data by: (i)processing the sequence data to obtain determined information indicatinga determined source and/or a determined integrity of the sequence data;and (ii) determining whether the determined information matches theasserted information.

Some embodiments provide for at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: a method, comprising: (a) obtaining nucleic acid datacomprising: (i) sequence data comprising at least 5 kilobases (kb) ofDNA and/or RNA, the sequence data obtained by sequencing a biologicalsample of a subject having, suspected of having, or at risk of having adisease; and (ii) asserted information indicating an asserted sourceand/or an asserted integrity of the sequence data; and (b) validatingthe nucleic acid data by: (i) processing the sequence data to obtaindetermined information indicating a determined source and/or adetermined integrity of the sequence data; and (ii) determining whetherthe determined information matches the asserted information.

Some embodiments provide for a method, comprising: (a) obtaining nucleicacid data comprising: (i) sequence data comprising at least 5 kilobases(kb) of DNA and/or RNA, the sequence data obtained by sequencing abiological sample of a subject having, suspected of having, or at riskof having a disease; and (ii) asserted information indicating anasserted source and/or an asserted integrity of the sequence data; and(b) validating the nucleic acid data by: (i) processing the sequencedata to obtain determined information indicating a determined sourceand/or a determined integrity of the sequence data; and (ii) determiningwhether the determined information matches the asserted information.

In some embodiments, when it is determined that the asserted informationmatches the determined information: (i) accessing a database of diseasefeatures; and (ii) processing the sequence data to determine whether itis indicative of one or more of the disease features; and (d) when it isdetermined that the asserted information does not match the determinedinformation: (i) indicating to a user that the determined and assertedinformation do not match; (ii) excluding the sequence data from furtheranalysis; and/or (iii) obtaining additional sequence data and/or otherinformation about the biological sample and/or the subject.

In some embodiments, the asserted information for the sequence data isbased on (one, at least two, at least three, between 2 and 10, between 5and 10 pieces of) information selected from the group consisting of: MHCallele sequence information; nucleic acid type; subject identity; sampleidentity; tissue type from which the sample was obtained; tumor typefrom which the sample was obtained; sequencing platform used to generatethe sequence data; sequence integrity; polyA status of an RNA sample(e.g., indicating whether the RNA sample was polyA enriched); totalsequence coverage; exon coverage; chromosomal coverage; ratio ofexpression levels of nucleic acids encoding two or more subunits of thesame protein; contamination; single nucleotide polymorphisms (SNPs);complexity; and/or guanine (G) and cytosine (C) percentage (%).

In some embodiments, the determined information for the sequence data isbased on (one, at least two, at least three, between 2 and 10, between 5and 10 pieces of) information selected from the group consisting of: MHCallele sequence information; nucleic acid type; subject identity; sampleidentity; tissue type from which the sample was obtained; tumor typefrom which the sample was obtained; sequencing platform used to generatethe sequence data; sequence integrity; polyA status of an RNA sample(e.g., indicating whether the RNA sample was polyA enriched); totalsequence coverage; exon coverage; chromosomal coverage; ratio ofexpression levels of nucleic acids encoding two or more subunits of thesame protein; contamination; single nucleotide polymorphisms (SNPs);complexity; and/or guanine (G) and cytosine (C) percentage (%).

In some embodiments, the disease is cancer. In some embodiments, thesubject is human.

In some embodiments, the source of the sequence data is a subject, atissue type, a tumor type, an RNA sequence type, or a DNA sequence type.

In some embodiments, the subject from which the sequence data isobtained is evaluated by determining one or more MHC sequences, forexample, by determining MHC sequences for six MHC loci.

In some embodiments, the source of one or more nucleic acid sequencedata sets is evaluated by determining a SNP concordance for the nucleicacid sequence data sets.

In some embodiments, the integrity of the sequence data is evaluated bydetermining exon coverage, one or more ratios of protein subunitencoding nucleic acids, and/or gene coverage of the sequence data.

In some embodiments, the integrity of RNA sequence data is evaluated bydetermining coverage of one or more genes in the RNA sequence data.

In some embodiments, the integrity of RNA sequence data is evaluated bydetermining a relative coverage of two or more exons for at least onegene in the RNA sequence data.

In some embodiments, the integrity of RNA sequence data is evaluated bydetermining an expression ratio of two known reference genes in the RNAsequence data.

In some embodiments, the method further comprises determining a level ofnucleic acid degradation, contamination, and/or GC content.

In some embodiments, determining whether RNA sequence data is polyA RNAsequence data or total RNA sequence data comprises determining theexpression level of one or more mitochondrial and/or histone genes inthe RNA sequence data.

In some embodiments, the sequencing platform that was used forgenerating WES sequence data is identified by determining a percent (%)variance for one or more reference genes in the WES sequence data.

In some embodiments, the method further comprises generating a reportthat indicates an extent of a match between one or more features thatare determined from the sequence data and one or more correspondingasserted features in the asserted information.

EQUIVALENTS AND SCOPE

All of the features described in this specification may be combined inany combination. Each feature described in this specification may bereplaced by an alternative feature serving the same, equivalent, orsimilar purpose. Thus, unless expressly stated otherwise, each featuredescribed is only an example of a generic series of equivalent orsimilar features.

All of the features described in this specification may be combined inany combination. Each feature described in this specification may bereplaced by an alternative feature serving the same, equivalent, orsimilar purpose. Thus, unless expressly stated otherwise, each featuredescribed is only an example of a generic series of equivalent orsimilar features.

From the above description, one skilled in the art can easily ascertainthe essential characteristics of the present disclosure, and withoutdeparting from the spirit and scope thereof, can make various changesand modifications of the disclosure to adapt it to various usages andconditions. Thus, other embodiments are also within the claims.

The terms “program” or “software” are used herein in a generic sense torefer to any type of computer code or set of processor-executableinstructions that can be employed to program a computer or otherprocessor (physical or virtual) to implement various aspects ofembodiments as described above. Additionally, according to one aspect,one or more computer programs that when executed perform methods of thetechnology described herein need not reside on a single computer orprocessor, but may be distributed in a modular fashion among differentcomputers or processors to implement various aspects of the technologydescribed herein.

Processor-executable instructions may be in many forms, such as programmodules, executed by one or more computers or other devices. Generally,program modules include routines, programs, objects, components, datastructures, etc. that performs particular tasks or implement particularabstract data types. Typically, the functionality of the program modulesmay be combined or distributed.

Also, data structures may be stored in one or more non-transitorycomputer-readable storage media in any suitable form. For simplicity ofillustration, data structures may be shown to have fields that arerelated through location in the data structure. Such relationships maylikewise be achieved by assigning storage for the fields with locationsin a non-transitory computer-readable medium that convey relationshipbetween the fields. However, any suitable mechanism may be used toestablish relationships among information in fields of a data structure,including through the use of pointers, tags or other mechanisms thatestablish relationships among data elements.

Various inventive concepts may be embodied as one or more processes, ofwhich examples have been provided. The acts performed as part of eachprocess may be ordered in any suitable way. Thus, embodiments may beconstructed in which acts are performed in an order different thanillustrated, which may include performing some acts simultaneously, eventhough shown as sequential acts in illustrative embodiments.

As used herein in the specification and in the claims, the phrase “atleast one,” in reference to a list of one or more elements, should beunderstood to mean at least one element selected from any one or more ofthe elements in the list of elements, but not necessarily including atleast one of each and every element specifically listed within the listof elements and not excluding any combinations of elements in the listof elements. This definition also allows that elements may optionally bepresent other than the elements specifically identified within the listof elements to which the phrase “at least one” refers, whether relatedor unrelated to those elements specifically identified. Thus, forexample, “at least one of A and B” (or, equivalently, “at least one of Aor B,” or, equivalently “at least one of A and/or B”) can refer, in oneembodiment, to at least one, optionally including more than one, A, withno B present (and optionally including elements other than B); inanother embodiment, to at least one, optionally including more than one,B, with no A present (and optionally including elements other than A);in yet another embodiment, to at least one, optionally including morethan one, A, and at least one, optionally including more than one, B(and optionally including other elements); etc.

The phrase “and/or,” as used herein in the specification and in theclaims, should be understood to mean “either or both” of the elements soconjoined, i.e., elements that are conjunctively present in some casesand disjunctively present in other cases. Multiple elements listed with“and/or” should be construed in the same fashion, i.e., “one or more” ofthe elements so conjoined. Other elements may optionally be presentother than the elements specifically identified by the “and/or” clause,whether related or unrelated to those elements specifically identified.Thus, as an example, a reference to “A and/or B”, when used inconjunction with open-ended language such as “comprising” can refer, inone embodiment, to A only (optionally including elements other than B);in another embodiment, to B only (optionally including elements otherthan A); in yet another embodiment, to both A and B (optionallyincluding other elements); etc.

In the claims articles such as “a,” “an,” and “the” may mean one or morethan one unless indicated to the contrary or otherwise evident from thecontext. Claims or descriptions that include “or” between one or moremembers of a group are considered satisfied if one, more than one, orall of the group members are present in, employed in, or otherwiserelevant to a given product or process unless indicated to the contraryor otherwise evident from the context. The disclosure includesembodiments in which exactly one member of the group is present in,employed in, or otherwise relevant to a given product or process. Thedisclosure includes embodiments in which more than one, or all of thegroup members are present in, employed in, or otherwise relevant to agiven product or process.

Furthermore, the described methods and systems encompass all variations,combinations, and permutations in which one or more limitations,elements, clauses, and descriptive terms from one or more of the listedclaims are introduced into another claim. For example, any claim that isdependent on another claim can be modified to include one or morelimitations found in any other claim that is dependent on the same baseclaim. Where elements are presented as lists, e.g., in Markush groupformat, each subgroup of the elements is also described, and anyelement(s) can be removed from the group. It should it be understoodthat, in general, where the systems and methods described herein (oraspects thereof) are referred to as comprising particular elementsand/or features, certain embodiments of the systems and methods oraspects of the same consist, or consist essentially of, such elementsand/or features. For purposes of simplicity, those embodiments have notbeen specifically set forth in haec verba herein.

It is also noted that the terms “including,” “comprising,” “having,”“containing”, “involving”, are intended to be open and permits theinclusion of additional elements or steps. Where ranges are given,endpoints are included. Furthermore, unless otherwise indicated orotherwise evident from the context and understanding of one of ordinaryskill in the art, values that are expressed as ranges can assume anyspecific value or sub-range within the stated ranges in differentembodiments of the described systems and methods, to the tenth of theunit of the lower limit of the range, unless the context clearlydictates otherwise.

Use of ordinal terms such as “first,” “second,” “third,” etc., in theclaims to modify a claim element does not by itself connote anypriority, precedence, or order of one claim element over another or thetemporal order in which acts of a method are performed. Such terms areused merely as labels to distinguish one claim element having a certainname from another element having a same name (but for use of the ordinalterm).

Additionally, as used herein the terms “patient” and “subject” may beused interchangeably. Such terms may include, but are not limited to,human subjects or patients. Such terms may also include non-humanprimates or other animals.

The terms “approximately”, “substantially,” and “about” may be used tomean within ±20% of a target value in some embodiments, within ±10% of atarget value in some embodiments, within ±5% of a target value in someembodiments, and within ±2% of a target value in some embodiments. Theterms “approximately” and “about” may include the target value.

This application refers to various issued patents, published patentapplications, journal articles, and other publications, all of which areincorporated herein by reference. If there is a conflict between any ofthe incorporated references and the instant specification, thespecification shall control. In addition, any particular embodiment ofthe present disclosure that fall within the prior art may be explicitlyexcluded from any one or more of the claims. Because such embodimentsare deemed to be known to one of ordinary skill in the art, they may beexcluded even if the exclusion is not set forth explicitly herein. Anyparticular embodiment of the systems and methods described herein can beexcluded from any claim, for any reason, whether or not related to theexistence of prior art.

Those skilled in the art will recognize or be able to ascertain using nomore than routine experimentation many equivalents to the specificembodiments described herein. The scope of the present embodimentsdescribed herein is not intended to be limited to the above Description,but rather is as set forth in the appended claims. Those of ordinaryskill in the art will appreciate that various changes and modificationsto this description may be made without departing from the spirit orscope of the present disclosure, as defined in the following claims.

What is claimed is:
 1. A method, comprising: obtaining a firstbiological sample of a first tumor, the first biological samplepreviously obtained from a subject having, suspected of having or atrisk of having cancer; extracting RNA from the first biological sampleof the first tumor to obtain extracted RNA; enriching the extracted RNAfor coding RNA to obtain enriched RNA; sequencing, using at least onesequencing platform, the enriched RNA to obtain RNA expression datacomprising at least 5 kilobases (kb); using at least one computerhardware processor to perform: obtaining the RNA expression data usingthe at least one sequencing platform; converting the RNA expression datato gene expression data; determining bias-corrected gene expression datafrom the gene expression data at least in part by removing, from thegene expression data, expression data for at least one gene thatintroduces bias in the gene expression data; and identifying a cancertreatment for the subject using the bias-corrected gene expression data.2. The method of claim 1, further comprising: administering theidentified cancer treatment to the subject.
 3. The method claim 1,wherein enriching the RNA for coding RNA comprises performing polyAenrichment.
 4. The method claim 1, wherein the at least one gene thatintroduces bias in the gene expression data comprises: a gene having anaverage transcript length that is higher or lower than an average lengthof transcripts in the gene expression data; a gene having at least athreshold variation in average transcript expression level based ontranscript expression levels in reference samples; and/or a gene thathas a polyA tail that is at least a threshold amount smaller in lengthcompared to an average length of polyA tails of genes from: the firstbiological sample from which the RNA expression data was obtained and/ora reference sample.
 5. The method of claim 1, wherein the at least onegene that introduces bias in the gene expression data belongs to afamily of genes selected from the group consisting of: histone-encodinggenes, mitochondrial genes, interleukin-encoding genes,collagen-encoding genes, B-cell receptor-encoding genes, and T cellreceptor-encoding genes.
 6. The method of claim 5, wherein the at leastone gene comprises at least one histone-encoding gene selected from thegroup consisting of: HIST1H1A, HIST1H1B, HIST1H1C, HIST1H1D, HIST1H1E,HIST1H1T, HIST1H2AA, HIST1H2AB, HIST1H2AC, HIST1H2AD, HIST1H2AE,HIST1H2AG, HIST1H2AH, HIST1H2AI, HIST1H2AJ, HIST1H2AK, HIST1H2AL,HIST1H2AM, HIST1H2BA, HIST1H2BB, HIST1H2BC, HIST1H2BD, HIST1H2BE,HIST1H2BF, HIST1H2BG, HIST1H2BH, HIST1H2BI, HIST1H2BJ, HIST1H2BK,HIST1H2BL, HIST1H2BM, HIST1H2BN, HIST1H2BO, HIST1H3A, HIST1H3B,HIST1H3C, HIST1H3D, HIST1H3E, HIST1H3F, HIST1H3G, HIST1H3H, HIST1H3I,HIST1H3J, HIST1H4A, HIST1H4B, HIST1H4C, HIST1H4D, HIST1H4E, HIST1H4F,HIST1H4G, HIST1H4H, HIST1H4I, HIST1H4J, HIST1H4K, HIST1H4L, HIST2H2AA3,HIST2H2AA4, HIST2H2AB, HIST2H2AC, HIST2H2BE, HIST2H2BF, HIST2H3A,HIST2H3C, HIST2H3D, HIST2H3PS2, HIST2H4A, HIST2H4B, HIST3H2A, HIST3H2BB,HIST3H3, and HIST4H4.
 7. The method of claim 5, wherein the at least onegene comprises at least one mitochondrial gene selected from the groupconsisting of: MT-ATP6, MT-ATP8, MT-CO1, MT-CO2, MT-CO3, MT-CYB, MT-ND1,MT-ND2, MT-ND3, MT-ND4, MT-ND4L, MT-ND5, MT-ND6, MT-RNR1, MT-RNR2,MT-TA, MT-TC, MT-TD, MT-TE, MT-TF, MT-TG, MT-TH, MT-TI, MT-TK, MT-TL1,MT-TL2, MT-TM, MT-TN, MT-TP, MT-TQ, MT-TR, MT-TS1, MT-TS2, MT-TT, MT-TV,MT-TW, MT-TY, MTRNR2L1, MTRNR2L10, MTRNR2L11, MTRNR2L12, MTRNR2L13,MTRNR2L3, MTRNR2L4, MTRNR2L5, MTRNR2L6, MTRNR2L7, and MTRNR2L8.
 8. Themethod of claim 1, wherein determining the bias-corrected geneexpression data further comprises: after removing the expression datafor the at least one gene that introduces bias in the gene expressiondata, renormalizing the gene expression data.
 9. The method of claim 1,wherein converting the RNA expression data to gene expression datacomprises: removing non-coding transcripts from the RNA expression datato obtain filtered RNA expression data; and after removing thenon-coding transcripts, normalizing the filtered RNA expression data toobtain gene expression data in transcripts per million (TPM).
 10. Themethod of claim 1, wherein removing the non-coding transcripts from theRNA expression data comprises removing non-coding transcripts thatbelong to groups selected from the list consisting of: pseudogenes,polymorphic pseudogenes, processed pseudogenes, transcribed processedpseudogenes, unitary pseudogenes, unprocessed pseudogenes, transcribedunitary pseudogenes, IG C pseudogenes, IG J pseudogenes, IG Vpseudogenes, transcribed unprocessed pseudogenes, translated unprocessedpseudogene TR J pseudogenes, TR V pseudogenes, snRNA, snoRNA, miRNA,ribozymes, rRNA, Mt tRNA, Mt rRNA, scaRNA, retained introns, senseintronics, sense overlapping RNA, nonsense mediated decay RNA, non stopdecay RNA, antisense RNA, lincRNA, macro lncRNA, processed transcripts,3prime overlapping ncrna, sRNA, misc RNA, vault RNA, and TEC.
 11. Themethod of claim 1, further comprising: prior to performing the removalof the non-coding transcripts, aligning the RNA expression data to areference; and annotating the RNA expression data.
 12. The method ofclaim 1, wherein the RNA expression data comprises at least 25 millionpaired-end reads.
 13. The method of claim 12, wherein the RNA expressiondata comprises at least 50 million paired-end reads, with an averageread length of at least 100 bp.
 14. The method of claim 1, whereinidentifying the cancer treatment for the subject using thebias-corrected gene expression data comprises: determining, using thebias-corrected gene expression data, a plurality of gene groupexpression levels, the plurality of gene group expression levelscomprising a gene group expression level for each gene group in a set ofgene groups, wherein the set of gene groups comprises at least one genegroup associated with cancer malignancy, and at least one gene groupassociated with cancer microenvironment; and identifying the cancertreatment using the determined gene group expression levels.
 15. Themethod of claim 14, wherein the cancer treatment is selected from thegroup consisting of a radiation therapy, a surgical therapy, achemotherapy, and an immunotherapy.
 16. The method of claim 1, furthercomprising obtaining a second biological sample of a second tumor, thesecond biological sample previously obtained from the subject.
 17. Themethod of claim 16, further comprising: combining the first biologicalsample and the second biological sample to form a combined tumor sample,wherein extracting the RNA comprises extracting the RNA from thecombined tumor sample.
 18. The method of claim 16, further comprising:extracting RNA from the second biological sample; and combining the RNAextracted from the second biological sample with the RNA extracted fromthe first biological sample to form combined extracted RNA, whereinenriching the RNA for coding RNA comprises enriching the combinedextracted RNA for coding RNA.
 19. The method of claim 1, wherein theextracted RNA comprises at least 1 μg of RNA upon RNA extraction. 20.The method of claim 19, wherein the extracted RNA is at least 1000-6000ng in total mass, and has a purity corresponding to a ratio ofabsorbance at 260 nm to absorbance at 280 nm of at least 2.0.
 21. Themethod of claim 1, further comprising performing quality controlassessment on the RNA expression data at least in part by: obtainingasserted information indicating an asserted source and/or an assertedintegrity of the RNA expression data; processing the RNA expression datato obtain determined information indicating a determined source and/or adetermined integrity of the RNA expression data; and determining whetherthe determined information matches the asserted information.
 22. Themethod of claim 1, wherein processing the RNA expression data comprisesprocessing the RNA expression RNA to determine: a tissue type of thefirst biological sample; a tumor type of the first biological sample;and/or guanine (G) and/or cytosine (C) percentage (%).
 23. A system foridentifying a cancer treatment for a subject having, suspected having,or at risk of having cancer, the system comprising: at least onesequencing platform configured to generate gene expression data fromenriched RNA obtained from a first biological sample previously obtainedfrom the subject, wherein the enriched RNA was obtained by: (i)extracting RNA from the first biological sample of the first tumor toobtain extracted RNA; and (ii) enriching the extracted RNA for codingRNA to obtain enriched RNA, wherein the RNA expression data comprises atleast 5 kilobases (kb); at least one computer hardware processor; and atleast one non-transitory computer-readable storage medium storingprocessor-executable instructions that, when executed by the at leastone computer hardware processor, cause the at least one computerhardware processor to perform: obtaining the RNA expression data usingthe at least one sequencing platform; converting the RNA expression datato gene expression data; determining bias-corrected gene expression datafrom the gene expression data at least in part by removing, from thegene expression data, expression data for at least one gene thatintroduces bias in the gene expression data; and identifying a cancertreatment for the subject using the bias-corrected gene expression data.24. A system for identifying a cancer treatment for a subject having,suspected having, or at risk of having cancer, the system comprising: atleast one computer hardware processor; and at least one non-transitorycomputer-readable storage medium storing processor-executableinstructions that, when executed by the at least one computer hardwareprocessor, cause the at least one computer hardware processor toperform: obtaining RNA expression data from at least one sequencingplatform, the RNA expression data comprising at least 5 kilobases (5kb), wherein the RNA expression data was obtained, from a firstbiological sample of a first tumor previously obtained from the subject,at least in part by: (i) extracting RNA from the first biological sampleof the first tumor to obtain extracted RNA; and (ii) enriching theextracted RNA for coding RNA to obtain enriched RNA; converting the RNAexpression data to gene expression data; determining bias-corrected geneexpression data from the gene expression data at least in part byremoving, from the gene expression data, expression data for at leastone gene that introduces bias in the gene expression data; andidentifying a cancer treatment for the subject using the bias-correctedgene expression data.
 25. At least one non-transitory computer-readablestorage medium storing processor-executable instructions that, whenexecuted by at least one computer hardware processor, cause the at leastone computer hardware processor to perform: obtaining RNA expressiondata from at least one sequencing platform, the RNA expression datacomprising at least 5 kilobases (5 kb), wherein the RNA expression datawas obtained, from a first biological sample of a first tumor previouslyobtained from a subject having, suspected of having or at risk of havingcancer, at least in part by: (i) extracting RNA from the firstbiological sample of the first tumor to obtain extracted RNA; and (ii)enriching the extracted RNA for coding RNA to obtain enriched RNA;converting the RNA expression data to gene expression data; determiningbias-corrected gene expression data from the gene expression data atleast in part by removing, from the gene expression data, expressiondata for at least one gene that introduces bias in the gene expressiondata; and identifying a cancer treatment for the subject using thebias-corrected gene expression data.